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CHAPTER I Introduction 


1.1 Introduction 


This document supplements MIPS64 Specification by’MIPS Technologies Incorporated (MTI). The user should 


\tation of SB-1 processor. The purpose of this document is to 
ures listed in the MIPS64 Specification document. 


the SB-1 processor. 


1.2 Document Organization 


This document is organized as follows: 


Chapter 2 provides a general block diagram for the overall CPU and covers the functionality of the basic blocks. 
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Chapter 3 presents a general description of the ALU block together with an overview of the integer and load/ 
store instructions and their latencies. 


Chapter 4 delves into floating point architecture specifics and covers FP instructions, latencies, and restrictions 
for CP1 category of instructions. This chapter also covers MIPS-3D Application Specific Extension (ASE) 
category of instructions. 


Chapter 5 deals with MIPS MDMX ASE and related issues. 


Chapter 6 provides a basic description of SB-1 supported memory hierarchy, Caches, TLBs, Cache Operations, 
and Cache Coherency Attributes. 


Chapter 7 covers the MIPS64 Address Space as implemented by SB-1 core. 
Chapter 8 specifies a complete listing of the CPO registers supported in SB-1 core. 


Chapter 9 details the debug architecture for SB-1. SiByte-specific enhancements to the standard MIPS64 Debug 
architecture are described here. | | 


Chapter 10 describes the Error handling capabilities of SB-1. 


Chapter 11 addresses the performance monitoring architecture supported by SB-1. A complete listing of these 
features and their usage through CPO architecture space registers is described in this chapter. 


Chapter 12 provides a high level overview of the multiprocessing features supported by SB-1. Example code 
segments featuring the MP capabilities of the processor are provided here. 


Chapter 13 provides a list of all implementation-specific features in the MIPS64 Specification and provides the 
SB-1 resolutions of these features. 
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1.3 Additional Documentation 


The following documents are required as supplement to this specification. 


TABLE 1-1 Supplemental Documents to SB-1 Users Manual 


MIPS64 Specification, Revision 1.0 Consolidated MIPS I, I, II, 1'V, and V ISA MIPS Technologies Incorporated 
Specifications with a new Privileged Resource 
Architecture based on MIPS R4000 Processor 

MIPS-3D ASE Specification, Revision 1.0 Describes 3D enhancements to the basic MIPS Technologies Incorporated 
MIPS64 Architecture 

MDMxX Version 2.0 Specification, Revision 0.3.2 |Describes MIPS Multimedia Extensions to the | MIPS Technologies Incorporated 
basis MIPS64 Architecture 


MIPS RISC Architecture Volume I Describes MIPS basic instructions in detail MIPS Technologies Incorporated 
MIPS RISC Architecture Volume II Describes MIPS basic instructions in detail MIPS Technologies Incorporated 
MIPS Extended JTAG (EJTAG), Version 2.5 Describes MIPS EJTAG Specification MIPS Technologies Incorporated 


1.4 What is Missing or Incomplete in this Version of the Document? 


The following Chapters need additional work. A future revision of this document will provide further details. 
e Chapter 4: The FPU (CP1) and MIPS-3D ASE Instructions. The exception specific implementation details 
for the FPU unit need to be additionally elaborated upon. 


e Chapter 5: The MDMX ASE Instructions. The implementation details of this unit are not finalized and are 
subject to change from their current description in Chapter 5. 


e Chapter 11: The Performance Monitor Architecture. The implementation details of this unit are not finalized 
and are subject to change from their current description in Chapter 11. 
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CHAPTER 2 SB-1 Overview 


2.1 Introduction 


This chapter elaborates on high level features suppor core. Further detail on specific SB-1 


functionality is provided in subsequent chapters 


2.2 High Level Features 


SPECIFICATION 


(FSA) | MIPS64 with SIMD Floating Point Functionality 
MIPS-3D ASE 
MDMX ASE 


Pipe Architecture Dual Enhanced-Skew Execute 
Dual Memory 


Instruction Set Architectur 


Support for 0-cycle load-to-use instruction sequences 
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TABLE 2-1 SB-1 High Level Specification 


SPECIFICATION 
Split I and D (Harvard Architecture) 


FEATURE 


| 
p 
a 
oO 
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Instructio 32K, 4-way, 32-Byte Lines, LRU Replacement Policy 


Data 32K, 4-way, Non-blocking, 32-Byte Lines, LRU Replacement Policy 
3x Structures 


2-Level GShare, 4K-entry x 2-bit 


Direction Predictor 
Jump Register Cache 64-entry 
Return Stack 8-entry 
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a. MADD instruction is counted as 2 distinct operations: one multiply and one add 


Figure 2-1 shows a simplified block diagram for SB-1 core. Subsequent sections provide additional detail with 
regard to the internals of SB-1. | : 
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Skewed Pipes 


4x Issue EO 


Inorder 

2 EXE 

2 LwSt 
DirPred 


AKx2b 7 Lat = 1 cycle 


Total: 2 / Cycle 


RetStack 


8-Entry Skewed Pipes 


Fetch/ F0/AO 
Decode/ 


Issue/ 


- Control FI/AI 


Lat = 4/2 cycles 


ICache Double Pumped] DXCache 


Line = 32B LS1 Line = 32B 
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ssey, doous 
jug FILM 


64x2-Entry 


FIGURE 2-1 Simplified Block Diagram of SB-1 
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2.3 SB-1 Units 


Internally, the SB-1 core comprises the PC Unit, the Issue Unit, the Execute Unit, the Load/Store Unit, the 
Floating Point Unit, the MDMX Unit, the Memory Unit, the Bus Interface Unit, and the Level One Instruction 
and Data Caches. The following sections briefly describe the functionality of each unit. | 


2.3.1 The PC Unit 


The PC Unit performs the sequencing of the instruction fetch together with completing instructions and detecting 
exceptions. These functions are implemented via two subunits: the Fetch Unit and the Graduation Unit. The 
responsibility of the Fetch Unit is to predict the program flow and qualify fetched instructions. The Graduation 
Unit ensures that the instructions modify architected state in program order in light of branch mispredicts and 
instruction exceptions. | 


2.3.1.1 The Branch Unit 


As shown in Figure 2-1, SB-1 supports three unique structures to aid program control flow in three distinct areas. 


2.3.1.1.1 Two Level GShare, Branch Direction Predictor 


This structure, with 4K x 2bit entries works in conjunction with a 9-bit Branch History Register (BHR). The 
contents of BHR are Exclusive-ORed with 11 PC bits to provide an index into the direction predictor table which 
uses the 2-bit counter scheme to predict branch outcome (taken vs. not-taken). There can be up to 2 predictions 
per cycle. | | : 


2.3.1.1.2 Return Stack (RS) 


The eight entry processor-based Return Stack provides a mechanism to predict return addresses for subroutine . 
calls. SB-1 supports an 8-entry Return Stack. 


2.3.1.1.3 Jump Register Cache ([RC) 


The Jump Register Cache is used to accelerate the execution of indirect branches through registers. SB-1 
supports a 64-entry JRC. It provides a prediction mechanism for the target of indirect jumps through registers. 


2.3.2 The Issue Unit (Box) 


The Issue Unit is responsible for issuing instructions to various functional units, and for tracking their progress 
until they can be handed off to the graduation part of the PC Unit. This unit examines and keeps track of all 
structural hazards as well as data dependencies in order to issue up to 4 instructions per cycle. 
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SB-1 supports decoupled front and back ends; the machine can continue processing instructions on instruction 
cache misses. An instruction queue (IQ) buffers instructions as they are fetched from memory. 


2.3.3 The Execute Unit (E0, E1) 


The Execute Unit is responsible for execution of ALU, Shift, and Branch instructions in the MIPS64 ISA. This 
unit supports thirty-two 64-bit Integer registers and 64-bit HI/LO registers for multiplies. The execute unit 
supports dual 8-stage, fully pipelined, 1-cycle execution latency pipes with enhanced skewing to allow zero- 
cycle load-to-use sequences. Multiply and divide instructions take additional cycles to complete. 


2.3.4 The Load/Store Unit (LS0, LS1) 


The Load/Store Unit executes memory load and store operations supported by the MIPS64 ISA. The load/store 
unit supports dual 8-stage load/store pipelines with the ability to execute simple ALU instructions in one pipe. 
This reduces ALU to address generation penalty for load/store address computations. 


2.3.5 The Floating Point Unit (F0, F1) 
The Floating Point Unit executes MIPS64 floating point and MIPS-3D ASE categories of instructions. It is 
TEEE-754 compatible and has support for Single, Double, and Paired-Single data formats. 


There are thirty-two 64-bit Floating Point Registers in the FP Unit. The unit supports dual 11-stage, fully 
pipelined, 4-cycle execution latency pipes with enhanced skewing to allow zero-cycle load-to-use sequences. 


2.3.6 The MDMX Unit (AO, A1) 


The MDMX unit implements MIPS MDMxX instructions using the same registers as the Floating Point Unit. It 
supports thirty-two 64-bit Floating Point Registers. The unit supports dual 9-stage, fully pipelined, 2-cycle 
execution latency pipes with enhanced skewing to allow zero-cycle load-to-use sequences. 


The MDMX unit has extended accumulator support, with 24 and 48-bit modes for 8 and 16-bit SIMD 
computations, respectively. 


2.3.7 The Memory Unit (MBox) 


The Memory Unit implements the memory management functionality, as outlined in the MIPS64 Privileged 
Resource Architecture. In particular, it supports Coprocessor 0 (CPO) functionality of TLB, CACHE, and SYNC 
category of instructions. 
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2.3.8 The Bus Interface Unit (BIU) 


This unit provides the interface between the core and the external bus. 


2.3.9 Level One Instruction and Data Caches 


SB-1 supports a 32KB, 4-way set associative, virtually-indexed and virtually-tagged instruction cache and a 
32KB, 4-way set associative, physically-indexed and physically-tagged data cache. This provides the processor 
with a sizable portion of fast, on-chip memory. 


SB-1 has a non-blocking data cache with support for up to 8 outstanding cachelines. 


2.4 SB-I Specifics 


The remaining chapters in this document provide further details on the specifics of the major units in SB-1 core. 
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CHAPTER 3 The CPU Instructions 


3.1 Introduction 


ctions, as supported in SB-1 implementation. 


This chapter provides a general overview of MIPS6 1 
supplement to MIPS64 Specification document. 


The information provided here should be regard 


3.2 List of Instructions 


Table 3-1 through Table 3-9 prewide. the F t of CPU category instructions supported in SB-1. 


3.2.1 CPU Loa 7 iemory Control Instructions 


TABLE 3-1 CPU Load, Store, and Memory Control Instructions 
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TABLE 3-1 CPU Load, Store, and Memory Control Instructions 


Mnemonic|Instruction 


3.2.2 CPU Arithmetic Instructions 


TABLE 3-2 CPU Arithmetic Instructions 


Mnemonic|Instruction 
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TABLE 3-2 CPU Arithmetic Instructions 


eecad 


=< eS) 
: 
on 
S 


Vv 
DIVU 
MULT 


Divide Word 

Divide Unsigned Word 
Multiply Word 

Multiply Unsigned Word 


: 
”n 
S 
ie) 
Cc 


Subtract Unsigned Doubleword 


3.2.3 CPU Logical Instructions 


TABLE 3-3 CPU Logical Instructions 


AND [Logical AND 
oR |lesiai oR 
ORI [Logis OR imei _ 
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TABLE 3-3 CPU Logical Instructions 


Muemonic| Instruction 


Logical XOR 
XORI Logical XOR Immediate 


3.2.4 CPU Move Instructions 


TABLE 3-4 CPU Move Instructions 


MFHI Move from HI 
MFLO Move from LO 


3.2.5 CPU Shift Instruction 


TABLE 3-5 CPU Shift Instructions 
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TABLE 3-5 CPU Shift Instructions 


Mnemonie|Instructons 
DSRA32 =| Shift Doubleword Right Arithmetic +32 
DSRAV Shift Doubleword Right Arithmetic Variable 


DSRL Shift Doubleword Right Logical 
DSRL32_ =| Shift Doubleword Right Logical +32 
DSRLV Shift Doubleword Right Logical Variable 


3.2.6 CPU Branch and Jump Instructions 


TABLE 3-6 CPU Branch and Jump Instructions 


Branch on Equal 


Branch on Greater Than or Equal Zero 


iy ee) 
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fo} 
eh 
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a 
Se 


Branch on Greater Than Zero 


Branch on Less Than or Equal Zero 


BLEZ 
Branch on Less Than Zero 


w 
ce 
epi 
N 


L 
JALR 


Jump Register 


3.2.7 CPU Trap Instructions 


rn 
tonne 
& 
= 
ao) 
eS) 
i=) 
Q 
C 


TABLE 3-7 CPU Trap Instructions 


REAK 
SYSCALL |System Call 


TEQ Trap if Equal 


wo 
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TABLE 3-7 CPU Trap Instructions 


Mnemonic |instruction 


3.2.8 Obsolete Branch Instructions 


Software is strongly encouraged to avoid use of the Branch Likely instructions, as they will be removed from a 
future revision of the MIPS64 Architecture. 


TABLE 3-8 Obsolete Branch Instructions 


Mnemonie|Instruction 
ara 


BLTZALL {Branch on Less Than Zero and Link Likely 


Branch on Less Than Zero Likely 
Branch on Not Equal Likely 


w 
ft 
tT 
N 
Ge 


w 


FETE 
SSS 
7 a 


we 
Zz 
3! 
aa 
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3.2.9 Embedded Application Instructions 


TABLE 3-9 Embedded Application Instructions 


3.3 Block and Pipeline Diagrams 


The CPU block of SB-1 consists of a an Execute unit (EXE) and a Load/Store Unit (LS). Figure 3-1 shows the 
pipes in each unit. 


LS Unit 


FIGURE 3-1 EXE and LS Pipes in SB-1 
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3.3.1 The EXEO Unit 


Figure 3-2 shows a block diagram of instruction execution flow in the EXEO pipe. 


PC 
Predicted PC 


RIT/Imm 
RS 


Branch Eval 


Mispredict 
Target PC 


Output to Reg File 


FIGURE 3-2 EXEO Block Diagram 
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The List of instructions supported by EXEO Unit is shown in Table 3-10. 


TABLE 3-10 Instructions Supported by the EXEO Unit 


Conditional Moves 
MOVT, MOVF 
MOVZ, MOVN 


The pipeline diagram for EXEO unit is shown in Table 3-11. 


TABLE 3-11 EXEO Pipe Stages in SB-1 


Branch Ops Fetch | Decode | Issue |Skew1 |Skew2} Read RF Check Prediction -- signal redirect if Write Link 
mispredicted; Compute target and link Address to RF 
CP1 Branch Fetch | Decode | Issue |Skew1 |Skew2 {Read RF 
Ops mispredicted; Compute target and link 
poate nes addresses 


addresses 
IMOVF, MOVT MOVT Fetch Skew! |Skew2 Read FP [Read FP CCs_| Evaluate Condition 


eon ove bisdo kaon eins Gael Sora el ee RF Compare rt to zero, Signal result to FP Unit a 
MOVN 


a. FP Condition Codes are not bypassed 


Write Link 
Address to RF 


Check Prediction -- signal redirect if 


The skewed slots in pipe stages 1 and 2 allow the coissuing of load/store and dependent EXE instructions in the 
same cycle. 
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3.3.2 The EXE1 Unit 
The block diagram for EXE1 Unit is shown in Figure 3-3. 


Output to Reg File 


FIGURE 3-3 EXE1 Block Diagram 
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The List of instructions supported by EXEO Unit is shown in Table 3-12. 


TABLE 3-12 Instructions Supported by the EXE1 Unit 


List of Instructions supported by EXE1 Unit 
ADDs, SUBs, Logical Ops 
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Conditional Moves 


Multiplies 
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The pipeline diagram for EXE1 unit is shown in Table 3-13. 


TABLE 3-13 EXE] Pipe Stages in SB-1 (All Except Divide) 
CS a 


ReadRF — |Executel | Execute2 | Execute3 Write HI/LO 


Execute2 | Execute3 Execute4; Write HI 
Write LO 


Skew1 |Skew2|Read RF Execute! Execute? | Read HI/LO; | Write HI/LO 
Execute3 


mae 


sue |Shewi|Skew2] [Read HULO [Write RF ae: 
Skew1 | Skew2 | Read FP CCs | Evaluate Write RF 
Condition 
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Skew1 | Skew2 |Read RF Compare rt 
to zero, 
Signal 
result to FP 
Unit 


a. FP Condition Codes are not bypassed 
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As in EXEO, the skewed slots in pipe stages 1 and 2 allow the coissuing of load/store and dependent EXE 
instructions in the same cycle. 


Table 3-14 shows the pipeline stages for integer divide operations supported in EXE1 unit. 


TABLE 3-14 EXE1 Pipe Stages in SB-1 for Divide Instructions 


stage fe fp sp fs 68 ffs 

Pee ee eee | [amma] [nn 
Execute] Deassert div_busy 

a i Me 
Executel Deassert div_busy 


3.3.3 The LSO Unit 
The block diagram for LSO Unit is shown in Figure 3-4. 


Address to TLB 


Address Error 
Logic 


Address Error Exc 


FIGURE 3-4 LSO Unit Block Diagram 
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The List of instructions supported by LSO Unit is shown in Table 3-15. 


TABLE 3-15 Instructions Supported by the LSO Unit 


List of Instructions supported by LSO Unit 


Integer and Floating Point Loads and Stores 


The pipeline diagram for LSO unit is shown in Table 3-16. The diagram applies to both integer and floating point 
loads and stores. 


TABLE 3-16 LSO Pipe Stages in SB-1 


ee Ce Se Fe ee ee 


Loads {Fetch} Decode | Read RF |Compute Address; | Cache Tag Lookup, Write RF 
Access TLB Cache Data Read 
and Way Select 
Data Pushed Cache accessed after 
into DCFIFO graduation of store 
instruction and 
availability of free slot 
in DCache 


Stores |Fetch} Decode |Read RF |Compute Address; | PA Pushed into 
Access TLB DCFIFO 
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3.3.4 The LS1 Unit 
The block diagram for LS1 Unit is shown in Figure 3-5. 


Address to TLB — 


[Address Error 


, Lonie Address Error Exc 


CPO Driver 


CPO Bus 


Output to Reg File 


FIGURE 3-5 LS1 Block Diagram 
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The list of instructions supported by LS1 Unit is shown in Table 3-17. 


TABLE 3-17 Instructions Supported by the LS1 Unit 


List of Instructions supported by LS1 Unit 
ADDs, SUBs, Logical Ops 7 
LUI 


Loads and Stores 


Indexed Loads/Stores 
TLB OPs 

MT/MF CPO 

Cache Ops 


The pipeline diagram for LS1 unit is shown in Figure 3-18. 


TABLE 3-18 LS1 Pipe Stages in SB-1 


sige ro fpf ft 2 Ss fs Ss fs Sd 
ween] 


Loads Fetch | Decode | Read RF | Compute 
Address 

MF CO Fetch | Decode Drive Drive CPO Write RF 
Control Bus 


Stores Fetch} Decode | Read RF | Compute 
Address 
MT CO Fetch | Decode | Read RF Drive Drive Write RF 
Control |CPO Bus 


Data Cache 

Pushed accessed 

into after 

DCFIFO graduation 
of store 
instruction 
and 
availability 
of free slot 
in DCache 

Cache Ops | Fetch} Decode | Read RF | Compute 
Address 
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3.4 Instruction Latency and Throughput by Category of Instructions 


Table 3-19 shows the latency and throughput by category for all instructions supported in the EXE and LS Units. 


TABLE 3-19 Instruction Throughput and Latency for EXE and 


Throughput (1 Instruction/x cycles 
Latency per supported pipe) 
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3.5 Available Bypasses 


LS Units by Inst Category 


Co-issue w/ Dependent Op? 


Z 
° 


Yes -- to EXE Pipes Only 


olo1o|/o/olo};s]0 


Yes -- to EXE Pipes Only 
0 
Yes -- to EXE Pipes Only 


| 


Z 
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Table 3-20 shows the available bypasses among EX0, EX1, LSO and LS1 units. 


TABLE 3-20 List of Available Bypasses in SB-1 Core for EX0, EX1, LSO, and LS1 Units 
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3.6 Instruction Types Issued to each Pipe 


Table 3-21 summarizes the types of instructions that can be issued to each one of EX0, EX1, LSO, and LS1 pipes. 


TABLE 3-21 Instruction Types Issued to each Pipe 


Instruction Type —_—«{ EXO Pipe |EX1 Pipe (|LSO Pipe |LS1 Pipe 
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3.7 Issue Rules and Restrictions 


Table 3-22 identifies issue rules and restrictions for EXE and LS pipes. These restrictions are enforced by 
hardware interlocks. 


TABLE 3-22 Instruction Issue Rules and Restrictions for CPU instructions 


Instruction A Instruction B 


Any dependent op to EXO or EX1 | Dependent op can issue 3 cycles after MUL 
Any dependent op to LSO or LS1 | Dependent op can issue 8 cycles after the MUL 
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TABLE 3-22 Instruction Issue Rules and Restrictions for CPU instructions 


Instruction B Restrictions 


; : MFLO, MFHI MF* instruction can issue two cycles after the multiply 


DMULT MFLO MFLO can issue two cycles after DMULT 
DMULT MFHI MFHI can issue three cycles after DMULT 


DMULT Any Multiply No multiply instructions may be issued in the cycle immediately 
following a DMULT , 
: DIV, DDIV Another divide cannot be issued while there is a divide in the pipe 
; MFLO, MFHI HI/LO reads cannot be issued while there is a divide in the pipe 


: Any multiply except MUL Multiplies that write HI/LO cannot be issued while there is a divide 
in the pipe 


Shift or ALU op to EXE pipes | Any dependent LS op Dependent op can issue 4 cycles after the shift 


Dependent LS op Dependent op can issue the next cycle after the ALU op; cannot co- 
issue 


: : _ {Dependent LS op - Dependent op can issue 4 cycles after the LD/ST 
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3.8 Differences between 32 and 64-bit Modes of Operation 


EXE Ops: In 32-bit mode, 64-bit instructions cause reserved instruction exceptions 


LS/ST Ops: Address errors are generated based on the mode as specified in the MIPS64 Specification. 
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CHAPTER 4 The FPU (CP1) and MIPS-3D 
ASE Instructions 


4.1 Introduction 


IPS-3D ASE (Application Specific Extension) 
ere should be regarded as a supplement to 


This chapter provides a general overview of MIPS6 
instructions, as supported in SB-1. The informati 
MIPS64 Specification and MIPS-3D ASE doc 
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TABLE 4-1 FP Block Description 


Maximum Number of FP Operations per Cycle per Pipe} 4 Single Precision FP Operations 
(2 Multiply Adds on Paired Single Operands per One Instruction) 


Maximum Number of FP Operations per Cycle in SB-1 | 8 Single Precision FP Operations 


4.3 Block and Pipeline Diagrams 


Figure 4-1 shows a high level block diagram of the Floating Point Unit in SB-1. 


(Approximate) 


FP1 Pipe FPO Pipe | 


FIGURE 4-1 Block Diagram of the Floating Point Unit 
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Table 4-2 shows Floating Point pipe diagram for FPO and FP1 pipes. 


TABLE 4-2 FPO and FP1 Pipe Operation 


sue (ep pp fs fs Ss o S 


4.4 Instruction Latency and Throughput by Category of Instructions 


The following tables present the list of supported instructions in SB-1 Floating Point Unit and their associated 
latencies. 


TABLE 4-3 FPU Load/Store Instructions Supported in CPU Unit (Chapter 3) 


Description 
wl 


All instructions in Table 4-3 can be issued to LS1 pipe. The first four instructions can be additionally issued to 
LSO pipe. 


ower 
[ocr 
Luxe 


TABLE 4-4 FPU Arithmetic Instructions 


Instruction | Supported Data Formats _|Latency |Throughput (1 Instruction/x cycles per supported pipe) __ 
MUL 
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TABLE 4-4 FPU Arithmetic Instructions | 


Instruction Supported Data Formats Throughput (1 Instruction/x cycles per supported pipe) — 
1 
l 


MADD [Single Dowie PavedSwele[waws |r 

Sing, Double Paved inge[oW® [SSS 
Single, Double ParedSaple[OW®[t_—— 
Sing, Doole PaiedSmge[wwe [1 SSSSS—~S 
| 


Single, Double, Paired Single | 28/40/28 | 4 Insts/28 cycles (S), 4 Insts/40 cycles (D), 4 Insts/28 cycles (PS) 


~AixA | Zi Zz ” 

a\a|2|2\s ~ 

TSiSieleisi ols 
wig 


TABLE 4-5 FPU Move Instructions 


Supported Data Formats Throughput (1 Instruction/x cycles per supported pipe) 
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Single, Double 1 
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TABLE 4-6 FPU Convert Instructions 


CVT.D Single, Word, Long 4 


CEIL.W Single, Double 
CEIL.L Single, Double 
OOR.W |{Single, Double 
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TABLE 4-7 FPU Branch Instructions 


Refer to Chapter 3 
Refer to Chapter 3 
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Throughput (1 Instruction/x cycles per supported pipe) 
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TABLE 4-8 Obsolete* FPU Branch Instructions 


Supported Data Formats 


BCIFL Refer to Chapter 3 
BCITL Refer to Chapter 3 


Se sannenmAinnmemmanmmnamamntmmae ener neceeerta ene nn 


a. Software is strongly encouraged to avoid 
use of the Branch Likely instructions, as 
they will be removed from a future revi- 
sion of the MIPS64 architecture. 


4.5 MIPS-3D ASE Instructions 


Table 4-9 lists the MIPS-3D ASE instructions supported in SB-1. The execution of these instructions is 
supported through the Floating Point Unit. 


TABLE 4-9 MIPS-3D Instructions in the SB-1 Core 


Throughput (1 Instruction/x cycles per supported pipe) 
MULR 
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a. RECIP2 is implemented as nmsub fd, 1, fs, ft 
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4.6 Available Bypasses 


Table 4-10 shows the available bypasses among Floating Point, Load/Store and Integer Register File units. 


TABLE 4-10 List of Available Bypasses in SB-1 Core for EX0, EX1, LSO, and LS1 Units 


From/To___[FP0 [PI 


4.7 Differences between the Pipes 


Table 4-11 summarizes the types of instructions that can be issued to each one of FPO and FP1 floating point 
pipes. The two pipes are symmetrical for most regular floating point operations, but the majority of MIPS-3D 
instructions can be issued to FP! pipe only. 


TABLE 4-11 Instruction Types Issued to each Pipe 


FP1 Pipe 
Yes 
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4.8 Issue Rules and Restrictions 


Table 4-12 identifies the issue rules and restrictions for floating point instructions. 


TABLE 4-12 Issue Rules and Restrictions for Floating Point Instructions 


Instruction A Instruction B 
All Except below All Except below _—_| Dependent op can issue 4 cycles after instruction 


RECIP Any dependent op {Dependent op can issue 9 cycles after for Single Precision and Paired Singles 
and 15 cycles after for Double Precision 

RSQRT Any dependent op {Dependent op can issue 12 cycles after for Single Precision or Paired Singles 
and 21 cycles after for Double Precision 

DIV Any dependent op | Dependent op can issue 18 cycles after for Single Precision and Paired Singles 
and 24 cycles after for Double Precision 

SQRT Any dependent op j|Dependent op can issue 28 cycles after for Single Precision and Paired Singles 
and 40 cycles after for Double Precision 


MADD, MSUB, NMADD, Any dependent op |Dependent op can issue 8 cycles after for Single Precision, Double Precision, 
NMSUB, RECIP2, RSQRT2 and Paired Singles (unless accumulator of MADD, See Section 4.9.1). 


4.9 Implementation Details on Special Instructions 


The next sections comment on SB-1 specific implementation details with regard to a few FP instructions. 


4.9.1 MADD, MSUB, NMADD, NMSUB 


OPERATION fd, fr, fs, ft; fad «fs * ft +/- fr 


This group of instructions is implemented as an IEEE rounded multiply followed by an IEEE rounded add, all 
with an 8-cycle latency. Operand fr is read 4 cycles after operands fs and ft. It can also be sourced from a 
bypass. These instructions behave like a separately issued MUL followed by an ADD. Exception flags of both 
MUL and ADD are ORed and stored in the FCSR. 


An operation that accumulates the result of several multiplies is executed with 3 bubbles between subsequent 
ops. To avoid the bubbles, it is recommended to process up to 4 different multiply-accumulate type operations in 
parallel. An example follows: 


MADD £0, £0, £1, £2 (£0 = £0 + f1 * £2).% 
3 bubbles (nops) 
MADD £0, £0, £4, £5 (£0 = £0 + £4 * £5) 


3 bubbles (nops) 
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MADD 


The above sequence can be optimized by interleaving four independent streams as such: 


MADD 
MADD 
MADD 
MADD 


MADD 
MADD 
MADD 
MADD 


MADD 
MADD 
MADD 
MADD 


£0, 


£0, 
£8, 
£16, 
£24, 


£0, 
£8, 
£LG; 
£24, 


£0:, 
£8, 
£16; 
£24, 


£05: £6;: £7 


Sag bye 
£8, 
£16, 
£24, 


£0, 
£8, 
£16, 
£24, 


£0, 
£8; 
£16, 
£24, 


oat 
£9; 
Exy, 
£25; 


£4, 

£ii, 
£19; 
£214 


£6; 

£13; 
E21, 
£29:; 


£2 

£10 
£18 
£26 


£5 

£A2 
£20 
£28 


ney | 

£14 
£22 
£30 


(£0 = 


Stream 
Stream 
Stream 
stream 


Stream 
Stream 
Stream 
Stream 


Stream 
Stream 
Stream 
Stream 


EQ. a 6. 7) 


mW DH 


mWH 


PWN 
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These instructions are fully pipelined, i.e., each pipe can absorb a multiply-add type operation every cycle. 


4.9.2 DIV Operation 


DIV fd, €s, 


Et: 


fd «fs / ft 


In SB-1, this operation is implemented using RECIP.fmt, MUL.fmt, and a rounding step to obtain the correctly 
rounded IEEE result. DIV operations with exponent ft = 254, 253 for single precision and exponent of ft = 2047, 
2046 for double precision computes are not implemented and will cause an unimplemented exception. 


If rounding precision is not required, this instruction can be implemented using RECIP and MUL instructions. 
This saves the rounding step which take 8 cycles to execute. 


Hence the sequence 


RECIP. fmt 


MUL. fmt 


ty 
na pe 


has 8 fewer cycles than 


DIV. fmt 
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4.9.3 SQRT Operation 


SORT fd, fs, ft; fd ¢«sqrt(fs) 


In SB-1, this operation is implemented using RSQRT.fmt, MUL.fmt, and a rounding step to obtain the correctly 
rounded IEEE result. 


If rounding precision is not required, this instruction can be implemented using RECIP and MUL instructions. 
This saves the rounding step which take 8 cycles to execute. 


Hence the sequence 


RSORT.fmt £1, £2 
MUL. fmt fly £1. £2 


has 8 fewer cycles than 
SORT. fmt flict tee 


4.9.4 RECIP1 and RSQRT1 Operations 


RECIP1 computes an approximation of 1/x and RSQRT1 computes an approximation of 1/sqrt(x), both with at 
least 14 bits of precision for Single, Double and Paired Single operands. 


For further detail on these operations, refer to MIPS-3D Specifications. 


4.9.5 RECIP2 


RECIP2 computes -(a * b - 1) for any number in Single, Double, and Paired Single format and is implemented | 
using NMSUB fd, 1, fs, ft operation. 


4.9.6 RSQRT2 


RSQRT2 computes -(a * b - 1) / 2 for any number in Single, Double, and Paired Single format and is 
implemented using NMSUB fd, 1, fs, ft operation with divide by 2 factored in at the end. 


4.10 Supplemental FP Instruction in SB-1 


This section describes the supplemental floating point instructions supported in SB-1. 
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Floating Point Divide DIV.fmt 


26 25 16 15 


01 a — 1 


FIGURE 4-2 DIV Format 


Format: 
PIViS: ‘Ad, -tS;. Ft 
DIV.D fd, fs, ft 
DIV.PS fd, fs, ft SB-1 Addition 


Purpose: To divide FP values 


Description: fd «fs / ft 

The value in FPR fs is divided by the value in FPR ft. The result is calculated to infinite precision, 
rounded according to the current rounding mode in FCSR, and placed into FPR fd. The operands 
and result are values in format fmt. 


Restrictions: 

The fields fs, ft, and fd must specify FPRs valid for operands of type fmt; if they are not 

valid, the result is undefined. 

The operands must be values in format fmt; if they are not, the result is undefined and 

the value of the operand FPRs becomes undefined. 

Unimplemented Exceptions for exponent of ft = 254, 253 for DIV.S and DIV.PS and for exponent of 
ft = 2047, 2046 for DIV.D 
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Operation: 
StoreFPR (fd, fmt, ValueFPR(fs, fmt) / ValueFPR(ft, fmt) ) 


Exceptions: Coprocessor Unusable, Reserved Instruction 
Floating Point Exceptions: \nexact, Invalid Operation, Unimplemented Operation, 
Division-by-zero, Overflow, Underflow 
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Reciprocal Approximation RECIP.fmt 


26 25 21 20 16 15 1110 65 


31 0 
010001 010101 
6 5 5 5 5 6 


FIGURE 4-3 RECIP Format 


Format: 
RECIP.S fd, fs 
RECIP.D fd, fs 
RECIP.PS fd, fs SB-1 Addition 


Purpose: To approximate the reciprocal of an FP value (quickly) 


Description: fd <-1.0 / fs 


The reciprocal of the value in FPR fs is approximated and placed into FPR fd. The operand and result 
are values in format fmt. 


The numeric accuracy of this operation does not meet the accuracy specified by the IEEE 754 Floating 
Point standard. The computed result differs from both the exact result and the IEEE-mandated repre- 
sentation of the exact result by no more than one unit in the least-significant place (ULP). 


The result is not affected by the current rounding mode in FCSR. 
Restrictions: 
The fields fs and fd must specify FPRs valid for operands of type fmt; if they are not valid, 


the result is undefined. 
The operand must be a value in format fmt; if it is not, the result is undefined and the value of the oper- 
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and FPR becomes undefined. 


Operation: 
StoreFPR(fd, fmt, 1.0 / ValueFPR(fs, fmt) ) 


Exceptions: Coprocessor Unusable, Reserved Instruction 
Floating Point Exceptions: \nexact, Invalid Operation, Unimplemented Operation, 
Division-by-zero, Overflow, Underflow 


4-62 SB-1 Users Manual 


Supplemental FP Instruction in SB-1 SiByte Confidential 


Reciprocal Square Root Approximation RSQRT.fmt 


3] 


26 25 21 20 16 15 1110 65 0 
010001 010110 
6 5 5 5 5 6 


FIGURE 4-4 RSQRT Format 


Format: 
RSORT.S fd, fs 
RSORT.D fd, fs 
RSQRT.PS fd, fs SB-1 Addition 


Purpose: To approximate the reciprocal square root of an FP value (quickly) 


Description: fd <—1.0 / SORT(fs) 


The reciprocal of the positive square root of the value in FPR fs is approximated and placed into FPR 
fd. The operand and result are values in format fmt. 


The numeric accuracy of this operation does not meet the accuracy specified by the IEEE 754 Floating 
Point standard. The computed result differs from both the exact result and the IEEE-mandated repre- 
sentation of the exact result by no more than two units in the least-significant place (ULP). 


The result is not affected by the current rounding mode in FCSR. 
Restrictions: 
The fields fs and fd must specify FPRs valid for operands of type fmt; if they are not valid, the result is 


undefined. 
The operand must be a value in format fmt; if it is not, the result is undefined and the value of the oper- 
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and FPR becomes undefined. 


Operation: 
StoreFPR(fd, fmt, 1.0 / SquareRoot(ValueFPR(fs, fmt) )) 


Exceptions: Coprocessor Unusable, Reserved Instruction 
Floating Point Exceptions: \nexact, Invalid Operation, Unimplemented Operation, 
Division-by-zero, Overflow, Underflow 
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Floating Point Square Root SQRT.fmt 


31 21 20 16 


26 25 15 1110 65 0 
010001 000100 
6 5 5 5 5 6 


FIGURE 4-5 SQRT Format 


Format: 
SORT.S fd, fs 
SORT.D fd, fs 
SQRT.PS fd, fs SB-1 Addition 


Purpose: To compute the square root of an FP value 
Description: fd <« SORT(fs) 


The square root of the value in FPR fs is calculated to infinite precision, rounded according to the cur- 
rent rounding mode in FCSR, and placed into FPR fd. The operand and result are values in format fmt. 


If the value in FPR fs corresponds to — 0, the result is — O. 


Restrictions: 
If the value in FPR fs is less than 0, an Invalid Operation condition is raised. 


The fields fs and fd must specify FPRs valid for operands of type fmt; if they are not valid, the result is 
undefined. | 


The operand must be a value in format fmt; if it is not, the result is undefined and the value of the oper- 
and FPR becomes undefined. 
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Operation: 
StoreFPR(fd, fmt, SquareRoot(valueFPR(fs, fmt) )) 


Exceptions: Coprocessor Unusable, Reserved Instruction 
Floating Point Exceptions: \nexact, Invalid Operation, Unimplemented Operation 
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4.11 FIR Register Implementation in SB-1 


The Floating Point Implementation Register (FIR) is a 32-bit read-only register that contains information 
identifying the capabilities of the floating point unit, the floating point processor identification, and the revision 
level of the floating point unit. Figure 4-6 shows the format of the F/R register and Table 4-13 describes the FIR 
register fields. 


3] 20 19 18 17 16 15 8 7 0 


3p|ps}p | s| Implementation Revision 
1}1}1 1 Ox] Ox1 


FIGURE 4-6 FIR Register Format in SB-1 


TABLE 4-13 FIR Register Field Descriptions 


Reset State 


Indicates that the paired single (PS) floating point data type and instructions are 
implemented 


Indicates that the double-precision (D) floating point data type and instructions 
are implemented 


Identifies the floating point processor. This value matches the corresponding field 
of the PRId CPO register 


Specifies the revision number of the floating point unit. This allows software to 
distinguish between one revision and another of the same floating point processor 
type. 


4.12 Exception Processing 


This section is currently being worked on and will be fully included in the next revision of this document. 
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4.12.1 RESET 


After Reset, all exceptions are disabled, flush to zero is enabled, and rounding mode is set to RN (Round to 
nearest-even). 


4.12.2 FP Instruction Issue Policy with Exception Off Mode 


If no exception is enabled and flush to zero is enabled, the issue box optimally schedules FP operations into the 
FP unit. 


4.12.3 FP Instruction Issue Policy with Exception On Mode 


If any.exception is enabled or flush to zero is disabled, then no further operations will be issued for one cycle. If 
the current operation is a long-latency operation (DIV, SQRT, RECIP, RSQRT), then no operation will be issued 
until the long-latency operation is within 3 cycles of completion. 


Medium-latency operations (MADD, MSUB, NMADD, NMSUB, RECIP2, RSQRT2) will hold off the issue of 
any short or long latency operation until the medium latency operation is within 3 cycles of completion. 


4.12.4 Denormals 


The SB-1 will flush all denormals to zero if flush to zero is enabled. It will also flush all underflow results to 
zero. | 


If flush to zero is disabled, the SB-1 will cause an unimplemented operation exception for denormal inputs and 
underflowing results for arithmetic operations. 
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4.12.5 Exception Flags 


The following table shows the exception flag settings for various categories of floating point operations. 


TABLE 4-14 SB-1 Exception Behavior 


Jnput_____[Resuit__[rs fe |v|z|o lun 
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a! 

X 
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oi 
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oa 
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CHAPTER 5 The MDMX ASE Instructions 


5.1 Introduction 


This chapter provides a general overview of MDMX: 
provided here should be regarded as a supplement: 


PABSDIFF {Provides the absolute value of the difference of the elements of a pair of 8 x 8bit vectors 
PABSDIFFC | Provides the same functionality as PABSDIFF on the input vector pairs and accumulates the results 


A description of these instructions follows: 
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PAVG.OB Perform Bytewise Averaging 


3] 26 25 21 20 16 15 1110 65 0 
011110 001000 
6 5 5 5 5 6 


FIGURE 5-1 PAVGOB Format 


Format: PAVGOB vd, vs, vt 

Purpose: Perform Bytewise Averaging 

Description: vd[i] <-(vs[i] + select(fmtsel, vt)[i]) / 2 
This instruction only supports OB format. 

The sel field selects the values of vt[] used for each 1. 

Restrictions: 

No data-dependent exceptions are possible. 


The operands must be values in the specified format. If they are not, the results are undefined and the values of 
the operand vectors become undefined. 


The result of this instruction is undefined if the processor is executing in 16 FP register mode. 


5-72 SB-1 Users Manual 


List of Supplemental Instructions 


Operation: 


PAVG.OB 
ts ¢<-CPR[vs] 


tt <select(fmtsel, vt) 


OPR[vd] « PAVGOB 
|| PAVGOB 
|| PAVGOB 
| | PAVGOB 
| | PAVGOB 
| | PAVGOB 
|| PAVGOB 
| | PAVGOB 


(CS63. 56 5 


(tss5.. 
(tSa7.. 
(tS39 
(ts3, 
(tS23 


(tSis 


48 


40 


..32 


24 


ie 6 


. .08 


J 


Ul 


, 


, 


4 


4 


(tSo7..00 » 


Cte¢3. -56 ) 


Ctss. 


tt, 


Ct39, 
C31. 
tt3. 
tts. 
Eto, 


function PAVGOB(ts, tt) 


-48 


.-40 


-32 


24 


.16 


- 08 


-00 


PAVGOB <[(0 || ts) 


end PAVGOB 


) 
) 
) 
) 
) 
) 
) 


+ (0 || tt) +1] >> 1 


Exceptions: Co-processor Unusable, Reserved Instruction, MDMX Unusable 
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PABSDIFF.OB Perform Bytewise Absolute Value 


26 25 21 20 16 15 1110 


01 ; 110 — 


FIGURE 5-2 PAVGOB Format 


Format: PABSDIFF.OB vd, vs, vt 
Purpose: Perform Bytewise Absolute Value of Differences 
Description: vd[i] «<(vs[i] > select(fmtsel, vt)[i]) ? 


(vs[i] - select(fmtsel, vt)[i]) : (select(fmtsel, vt)[i] - vs[i]) 


This instruction only supports OB format. 

The sel field selects the values of vt[] used for each 1. 
Restrictions: 

No data-dependent exceptions are possible. 


The operands must be values in the specified format. If they are not, the results are undefined and the values of 
the operand vectors become undefined. 


The result of this instruction is undefined if the processor is executing in 16 FP register mode. 
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Operation: 
PABSDIFF.OB 


ts <—CPRIivs] 
tt «<select (fmtsel, 


OPR[vd] < PABSDIFFOB 


function PABSDIFFOB(ts, 


Exceptions: Co-processor Unusable, Reserved Instruction, MDMX Unusable 
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PABSDIFFOB 
PABSDIFFOB 
PABSDIFFOB 
PABSDIFFOB 
PABSDIFFOB 
PABSDIFFOB 
PABSDIFFOB 


vt) 


(tS,3. 


(tSco_ 


(tsq; 


(tS39. 
(tS31. 
(tsS23. 
(tsis. 


(tSo7. 


if ‘ts S= te 


else 


-36 


-48 


--40 
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. 08 
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/ 


Cee. - 
ttss_. 
tta7.. 
Cts, 
tts... 
CC23.. 
ttis.. 
Cto7.. 


Ce) 


PABSDIFFOB <-ts 


PABSDIFFOB <tt 
end PABSDIFFOB 
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PABSDIFFC.OB Perform Bytewise Absolute Value 


31 26 25 21 20 16 15 1110 65 


0 
011110 00000 110101 
6 5 5 5 5 6 


FIGURE 5-3 PAVGOB Format 


Format: PABSDIFFC.OB vs, vt 
Purpose: Accumulate Absolute Values of Differences of Byte Vectors 
Description: acc[i] « acc[i] + (vs[i] > select(fmtsel, vt)[il]l) ? 


(vs[i] - select(fmtsel, vt)[1i]) : (select(fmtsel, vt)[i] - vs[i]) 


This instruction only supports OB format. 

The sel field selects the values of vt[] used for each i 
Restrictions: 

No data-dependent exceptions are possible. 


The operands must be a value in the specified format. If they are not, the results are undefined and the values of | 
the operand vectors become undefined. 


The result of this instruction is undefined if the processor is executing in 16 FP register mode. 
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Operation: 
PABSDIFFC.OB 


ts <—CPR[([vs] 
tt <—select(fmtsel, vt) 


ACC €—PABSDIFFCOB (a¢¢191__168, 


function PABSDIFFCOB (a, 


Exceptions: Co-processor Unusable, Reserved Instruction, MDMX Unusable 
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PABSDIFFCOB 
PABSDIFFCOB 
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PABSDIFFCOB <~a + PABSDIFFOB(ts, 
end PABSDIFFCOB 
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5.3 MDMX ASE Instruction Categories in SB-1 


MDMxX ASE instructions fall into one of three categories as implemented in SB-1. These categories are shown 
in Table 5-2. 


TABLE 5-2 MDMxX Instruction Categories in SB-1 | 


Category | List of Instructions 


TYPE I TYPE I-0: No Condition Code (CC) involvement 


Non-Accumulator Instructions |ADD,SUB,MUL AND, OR, NOR, XOR, SLL, SRL, SRA, MSGN, ALNI, ALNV, MIN, 
MAX, SHFL, PAVG (SB-1 specific), PABSDIFF (SB-1 specific) 


TYPE I-1: Read CC 
PICKF, PICKT 


TYPE I-2: Write CC 
C.EQ, CLT, C.LE 


TYPE II MULS, MULSL, MULL, MULA, SUBA, SUBL, ADDA, ADDL, WACH, WACL, 
PABSDIFFC (SB-1 specific) 
TYPE III RZU, RNAU, RNEU, RZS, RNAS, RNES, RACH, RACL, RACM 
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5.4 MDMX Unit Block Diagram 


The MDMX unit supports 2 execution pipes: AO and Al. Figure 5-4 shows a block diagram of the AO pipe and 
Figure 5-4 show the block diagram for Al pipe. The following sections specify the types of instructions that can 
be issued to either pipe. 


Imm Operand From FPR(vt) From FPR(vs) 


; | 


MUL2 


Output to FP Register File 


FIGURE 5-4 AO Pipe Block Diagram 
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Imm Operand From FPR(vt) From FPR(vs) 


ee 
=) *] a 


| | rom CCR 


Output to FP Register File 


FIGURE 5-5 Al Pipe Block Diagram 
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5.5 Pipeline Flow by Category of Instructions 


The following sections outline the pipeline flow for the three types of instructions supported by the MDMX unit. 


5.5.1 TYPE I Pipe 
Table 5-3 shows MDMX pipe diagram for TYPE-I instructions. 


TABLE 5-3 MDMX TYPE-I Pipe Operation 


cL Ca 


TYPE I-1 |Fetch Decode Issue Read RF jExecutel |Execute 2 | Write RF 
Read CC 


Si [52 [Read RF [Exceuet_[Exeoute> [wie CO 


TYPE-IO and I2 instructions can be issued to either AO or Al pipe of the MDMX Unit, and TYPE-I1 can be 
issued to Al pipe only. 


5.5.2 TYPE II Pipe 
Table 5-4 shows MDMX pipe diagram for TYPE-II instructions. 


TABLE 5-4 MDMX TYPE-II Pipe Operation 


ee eC ee Se 


TYPEN [Fetch [Decode [Issue [81 ___|82__]Read RF_[Fxecutel Execute? | Write Accumulator 


TYPE-II instructions can be issued only to Al pipe of the MDMX Unit. 


5.5.3 TYPE III Pipe 
Table 5-5 shows MDMX pipe diagram for TYPE-III instructions. 


TABLE 5-5 MDMX TYPE-III Pipe Operation 


TYPE Il | III {Fetch Decode Issue Read RF ea Execute? | Write RF RF 
Read Accumulator 


TYPE-III instructions can be issued only to Al pipe of the MDMX Unit. 
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5.6 Instruction Latency and Throughput by Category of Instructions 


Table 5-6 presents a list of supported instructions in SB-1 Floating Point Unit and the associated latency. 


TABLE 5-6 MDMxX< Instruction Latency and Throughput 


[stro |ateney | aroghpat (i Iseacows eee pe uparted pp) | rine Dependent Op? 
rypem 2 tN 
ype f2 


5.7 Available Bypasses 


Table 5-7 shows the available bypasses for the MDMX unit. 


TABLE 5-7 List of Available 5 aa in SB-1 Core for EX0, EX1, LSO, and LS1 Units 


5.8 Differences between the Pipes 


Table 5-8 summarizes the types of instructions that can be issued to each one of AO and Al MDMX pipes. 
Except for TYPE-I Category instructions, all other instructions can be issued to Al pipe only. 


TABLE 5-8 Instruction Types Issued to each Pipe 


an Cie 
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5.9 Issue Rules and Restrictions 


Table 5-9 identifies issue rules and restrictions for MDMX ASE instructions. 


TABLE 5-9 Issue Rules and Restrictions for MDMX Instructions 


Instruction A 
Any TYPE I, I, or III | Any Dependent TYPE I, II, or HI | 1 cycle bubble 
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CHAPTER 6 Memory Hierarchy and the 
Primary Instruction and Data 
Caches 


6.1 Introduction 


This chapter elaborates on the supported memory hi 
caches in SB-1. For information on level two cack 
level user’s manual. 


uction and data caches, the memory controller, and the bus interface 


Figure 6-1 shows the organiz 
: ths and their speeds relative to the core are additionally shown on the 


unit around the S ore.: Bus ¥ 
diagram. 
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128 @ 1x (4 instructions) J-Cache 


64 @ 2x (load data) 
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FIGURE 6-1 Memory Structures and Bus Organization around the SB-1 Core 
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6.2.1 Level One (Primary) Caches 


The SB-1 core implements built-in level one instruction and data caches with flexible streaming features. The 
next two sections elaborate on the specifics of the primary caches. 


6.2.1.1 Instruction Cache (I-Cache) 
Table 6-1 outlines the main features of the primary instruction cache in SB-1. 


TABLE 6-1 SB-1 Primary Instruction Cache Characteristics 


Specifics 


1 © 
~ 
re) 
7 
® 
br 
S 
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™ 
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Line Size 


Virtually Indexed (44-bit address), Virtually Tagged* 


Critical Quad Word First (Half Line Resolution) 
Not Applicable 

Not Applicable 

1 bit/Byte 

Tag Parity 2 bits/Tag 


[ines 
Tag Paty 
ECC Spon 


ECC Support 


a. Includes ASID/G bit to avoid flushing for every context 
Switch 
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6.2.1.1.1 Accessing the Instruction Cache 


Figure 6-2 shows the manner in which the 44-bit virtual address is used to access a line. As shown in the figure, 
the index consists of 8 bits, resulting in 256 individually accessible sets of 32-byte lines by 4 ways. The address 
portion of the tag consists of 31 bits. 


__ 43 | 
Upper Address Index 


Hit/Miss  Hit/Miss _— Hit/Miss Hit/Miss 


FIGURE 6-2 Primary Instruction Cache Indexing in SB-1 
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6.2.1.1.2 Address Fields Decoding 


Figure 6-3 shows bank organization in the primary instruction cache for SB-1. 


Index 255 


Index 0 


32 bytes + tag 
(1 cache line) 


FIGURE 6-3 Instruction Cache Organization in SB-1 


6.2.1.1.3 Parity/ECC Support 


The primary instruction cache in SB-1 supports data and tag parities (shown in Table 6-1) but does not have 
ECC. 


6.2.1.1.4 Notes on the Virtual Nature of the Instruction Cache 


The following should be considered when dealing with the instruction cache: 


1. Virtual aliases may cause multiple copies of the same cache data to appear in the instruction cache. 


2. If amapped address is changed from a cached attribute to an uncached attribute, the cache lines must be 
flushed from the instruction cache to eliminate the stale instructions. For correct operation, mapped 
addresses to an uncached space must never be present in the instruction cache. 


Specifically, cachable and uncachable references to the same space do not preserve coherence. Note that the 
LI I-cache does not participate in the coherence algorithms. 


3. Because of (2) above, the I-Cache must be flushed before seeing a write into the code stream (e.g., planting a 
breakpoint). 
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4. Mapped addresses to uncached space will cause an instruction cache lookup and subsequent error detection to 
be performed. As a result, it is possible to detect an instruction cache error even though the page mapping for 
that address is uncached, causing what may be referred to phantom CacheError exceptions. 


The D-Cache will not supply data to satisfy an I-Cache miss for the same CPU. 


6. The ASID field in the EntryHi register should only be modified by a DMTCO or TLBR instruction in 
unmapped or in mapped global space, i.e. the G bit 1s set in the TLB entry. If the ASID is changed in mapped 
space that is not global, i.e. the G bit is cleared, the behavior of the processor is UNDEFINED, and TLB 
exceptions, including TLB Shutdown, may result. 


6.2.1.2 Data Cache (D-Cache) 


Table 6-2 outlines the main features of the Primary Data cache in SB-1. 


TABLE 6-2 SB-1 Primary Data Cache Characteristics 
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6.2.1.2.1 Accessing the Data Cache 


Figure 6-4 shows the manner in which the 44-bit virtual address is used to access a line. As shown in the figure, 
the index consists of 8 bits, resulting in 256 individually accessible 32-byte lines by 4 ways. The address portion 


of the tag consists of 28 bits.’ The next section elaborates on the full composition of bits in the tag field. 


FIGURE 6-4 Primary Data Cache Indexing in SB-1 


1. Bit 12 (shown in Figure 6-4) is part of the tag and part of the index. A program making explicit ref- 
erence to tags (via TagLo register) must be aware of this and maintain consistency between index 
and tags at that index. 
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6.2.1.2.2 Address Fields Decoding 


Figure 6-5 shows bank organization in the primary data cache for SB-1. 


Index 255 


+ 


32 bytes + tag 
(1 cache line) 


FIGURE 6-5 Data Cache Organization in SB-1 


6.2.1.2.3_Parity/ECC Support 


The Primary Data Cache in SB-1 supports tag parity (shown in Table 6-1) and has 64-bit ECC for the data 
portion. Single-bit errors are corrected and double-bit errors are detected. Refer to Chapter 10 for further 
description of these error cases. 


6.2.2 Rules for Uncached Data Accesses 
The following list tabulates the rules and restrictions for uncached accesses: 


Uncached accesses are issued 1n order. 

Uncached accesses are never issued speculatively. 

Uncached writes are blocked if there are any outstanding uncached reads. 

Uncached accesses may be issued in any number, subject to normal ReadQ/Write buffer depths. 
External system must maintain ordering: 


oS Se NS 


- Reads to a device must not pass reads or writes to same device. 
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- Writes to a device must not pass writes to same device. 


6.2.3 Operation of the Write Buffer 


The write buffer is a 10-deep storage structure that holds data on its way to memory. Each entry consists of a PA 
and 32 bytes of data. There are three logical sources for the data being put into the write buffer, outlined below: 


tT: 


Lines evicted from the Data Cache (as a result of a fill or a CACHE instruction). The data is always a full 
cache line (32 bytes). 


Uncacheable stores. These data come from the DCFIFO where they are held until store instructions graduate. 
The data in this case is between | and 8 bytes wide. 


Uncacheable accelerated stores. These are the same as uncacheable stores but can be merged in the write 
buffer, provided they obey the merging rules outlined next. 


Uncacheable loads (in order to maintain order with uncacheable stores). 


6.2.3.1 Merging Rules 


The write buffer contains one 32-byte merge buffer. 


The merge buffer begins merging when an uncached accelerated (UAC) double or single word block-aligned 
Store is executed. Merging continues if the next uncacheable write buffer request is a UAC double or single 
word store to an address within the same block. There are two merging modes. If the next request is to an 
identical address, then the merging mode is auto-increment, otherwise, the merging mode 1s sequential. The 
merging mode is established by the second UAC store to the block. 


Merging stops when one of the following conditions is met: 


An uncached or UAC load is executed. 
An uncached store is executed. 
A UAC partial-word store is executed. 


A change in the current merging mode is observed. 


A complete block 1s gathered!. 


The time-out counter indicates that 512 cycles have passed since the last UAC store was observed and no 
other write buffer request has happened. 


1. In sequential mode, the block is considered complete when the 2 highest-addressed words of the block are 
written. 
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Cached accesses to the write buffer do not disturb merging. When merging terminates, the data is placed into a 
write buffer entry and is ready to be issued to the system interface bus. 


When gathering in auto-increment mode, UAC double or singleword stores may be freely mixed. The data will 
be appended to the end of the already merged data in the merge buffer. However, if the merge buffer already 
contains seven valid words and the next request 1s a UAC double store, the doubleword will not fit into the same 
32-byte block. In this case, the seven words in the merge buffer are placed into a write buffer entry and the new 
double store starts a new merging block. 


When gathering in sequential mode, UAC singleword stores must occur in pairs to prevent address error 
exceptions. 7 | 


6.2.4 Prefetch Support for Primary Data Cache (User Level Prefetching and Streaming) 


The primary data cache in SB-1 supports a number of the Prefetch Hints as specified in MIPS64 Specification. 
Among the supported hints are regular data prefetching and streaming data through the data cache. 


Table 6-3 presents the high level features provided by PREF (PREFX) instruction. The subsequent two sections 
describe regular prefetching and streaming through the PREF (PREFX) instruction. 


TABLE 6-3 Cache Prefetch Support for Primary Data Cache 


Instruction Type PREF (or PREFX) Instruction, a Non-Privileged Instruction (refer to MIPS64 
Specifications). 


Description of Instruction PREF adds the 16-bit signed offset to the contents of GPR base to form an effective byte 
address. PREFX adds the contents of GPR index to the contents of GPR base to form an 
effective byte address. The hint field supplies information about how the addressed data is 
to be manipulated. 


PREF (PREFX) enables the processor to take some action as specified by the hint field, to 
improve program performance. The action taken for a specific PREF (PREFX) instruction 
is both system and context dependent. Any action, including doing nothing, 1s permitted as 
long as it does not change architecturally visible state or alter the meaning of aprogram. A 
PREF (PREFX) instruction either does nothing or takes an action that increases the 
performance of the program. 


PREF (PREFX) does not cause addressing-related exceptions. If it does happen to raise an 
exception condition, the exception condition is ignored. If an addressing-related exception 
condition is raised and ignored, no data movement occurs. 


PREF (PREFX) never generates a memory operation for a location with an uncached 
memory access type. 


For a cached location, the expected and useful action for the processor is to move a block of 
data between cache and the memory hierarchy. The size of the block transferred in SB-1 is 
one line of data (32 Bytes). 
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TABLE 6-3 Cache Prefetch Support for Primary Data Cache 


Granularity 1 Line (32 Bytes) 


Programming Notes Prefetch cannot access a mapped location unless the translation for that location is present 
in the TLB. Locations in memory pages that have not been accessed recently may not have 
translations in the TLB, so prefetch may not be effective for such locations. 


Prefetch does not cause addressing exceptions. It does not cause an exception to prefetch 
using a pointer before the validity of the pointer is determined. 


Hint field encodings whose function 1s described as “streamed” convey usage intent from 
software to hardware. Software should not assume that hardware will always prefetch data 
in an optimal way. 


6.2.4.1 Regular Data Prefetching 
Table 6-4 describes the regular data prefetch support provided by SB-1 core. 


TABLE 6-4 Regular Data Prefetch Support Provided by SB-1 


Feature 


Load Hint (Hint Value = 0) Use: Prefetched data is expected to be read (not modified) 
Action: Fetch data as if for a load 

Store Hint (Hint Value = 1) Use: Prefetched data is expected to be stored or modified 
Action: Fetch data as if for a store 


Always put in current LRU, upgrading way to MRU 


6.2.4.2 Streaming Prefetch Support in SB-I 


Table 6-5 outlines the streaming prefetch support provided by the primary data cache. 


TABLE 6-5 Streaming Prefetch Support in SB-1 


Description 


Instruction Type PREF or PREFX Instruction, Non-Privileged (refer to MIPS64 Specifications) 


Load_Streamed Hint (Hint Value = 4) Use: Prefetched data is expected to be read (not modified) but not reused extensively; it 
“streams” through the cache. 


Store_Streamed Hint (Hint Value = 5) Use: Prefetched data is expected to be stored or modified but not reused extensively; it 
| “streams” through the cache. 


Data Placement If the block is already in the cache, treat as regular prefetch (can upgrade to MRU). 
Otherwise, replace the LRU way without upgrading that way, and mark the way so that 
subsequent hits do not upgrade. The next fill to that index resets the way settings. 
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6.2.5 The PREF and PREFX Instructions in SB-1! 
This section presents a general description of PREF and PREFX instructions in SB-1. 


PREF 


110011 


FIGURE 6-6 Format for PREF Instruction 


COP1X | Base | Index | Hint 0 
010011 001111 


FIGURE 6-7 Format for PREFX Instruction 


Format: 
PREF hint, offset(base) 


PREFX Hint. index(base) 


Purpose: 
To move data between memory and cache. 


1. The PREFX instruction is identical to PREF but supports base + index addressing instead. Except 
for address computation, all descriptions for PREF apply equally to PREFX. 
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Description: 
PREF adds the 16-bit signed offset to the contents of GPR base to form an effective byte address. 


PREFX adds the contents of GPR index to the contents of GPR base to form an effective byte address. 

The hint field supplies information about how the addressed data is to be manipulated. 

PREF enables the processor to take some action as specified by the hint field, to improve program performance. 
The action taken for a specific PREF instruction is both system and context dependent. Any action, including 


doing nothing, is permitted as long as it does not change architecturally visible state or alter the meaning of a 
program (refer to Table 6-6 for more details). 


PREF does not cause addressing-related exceptions. If 1t does happen to raise an exception condition, the 
exception condition is ignored. If an addressing-related exception condition is raised and ignored, no data 
movement occurs. 


PREF never generates a memory operation for a location with an uncached memory access type. For a cached 
location, the expected and useful action for the processor is to move a block of data between cache and the 
memory hierarchy. The size of the block transferred in SB-1 is one line (32 bytes). 


Table 6-6 defines the hint field values. 


TABLE 6-6 PREF Hint Field Encodings 


Data Use and Desired PREF Action SB-1 Reference 


Use: Prefetched data is expected to be read (not modified) Supported in SB-1 
Action: Fetch data as if for a load. Section 6.2.4.1: “Regular 

Data Prefetching” 

store Use: Prefetched data is expected to be stored or modified Supported in SB-1 
Action: Fetch data as if for a store. Section 6.2.4.1: “Regular 

Data Prefetching” 


1es) 


Reserved for future use - not available to implementations. RO 


Use: Prefetched data is expected to be read (not modified) but not Supported in SB-1 


reused extensively; it “streams” through the cache Gectinn6O4>: 
store_streamed 


Action: Fetch data as if for a load and place it in the cache so that it 


“Streaming Prefetch 
does not displace data prefetched as “retained” 


Support in SB-1” 


Use: Prefetched data is expected to be stored or modified but not 
reused extensively; it “streams” through the cache 


Action: Fetch data as if for a store and place it in the cache so that it | “Streaming Prefetch 
does not displace data prefetched as “retained” Support in SB-1” 


Not Applicable Not Supported in SB-1 
Not Applicable Not Supported in SB-1 
-24 Reserved for future use - not available to implementations ray 


Supported in SB-1 
Section 6.2.4.2: 


lo.) 
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TABLE 6-6 PREF Hint Field Encodings 


Data Use and Desired PREF Action SB-1 Reference 


writeback_invalidate Use: Data is no longer to be expected to be used Supported in SB-1 


Action: schedule a writeback of any dirty data. At the completion of {Explanation to the left 
the writeback, mark as invalid the state of any cache line written back. 


26-31 | Implementation Dependent | Unassigned by the Architecture eet 


N 
An 


(also known as nudge) 
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Restrictions: 
None 


Operation: 
vAddr <—- GPR[base] + sign extend(offset)! 


(pAddr, CCA) < AddressTranslation(v Addr, DATA, LOAD)* 
Prefetch(CCA, pAddr, vAddr, DATA, hint) 


Exceptions: 
Prefetch does not take any TLB-related or address-related exceptions under any circumstances. 


Programming Notes: 

Prefetch cannot access a mapped location unless the translation for that location 1s present in the TLB. Locations 
in memory pages that have not been accessed recently may not have translations in the TLB, so prefetch may not 
be effective for such locations. 


Prefetch does not cause addressing exceptions. It does not cause an exception to prefetch using a pointer before 
the validity of the pointer is determined. 


Hint field encodings whose function is described as “streamed” convey usage intent from software to hardware. 


Software should not assume that hardware will always prefetch data in an optimal way. 


Implementation Notes: 
The SB-1 does not trigger a data watch by a prefetch instruction whose address matches the Watch register 
address match conditions. 


1. For PREFX, the address computation is: vAddr <- GPR[base] + GPR[index] 
2. AddressTranslation, as used here, cannot raise any exceptions. 
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6.3 CACHE Instructions 


This section covers the microarchitectural implementation of the MIPS CACHE instructions. For reference, the 
table of CACHE instructions is repeated here with a brief description of each operation. In addition, the registers 
used with the CACHE instructions, i.e., TagHi, TagLo, DataHi, and DataLo are defined. 


6.3.1 CACHE Variants 


The following tables, divided by cache type, list the types of cache operations defined by MIPS and implemented 
by SB-1. Special debug cache operations are described as well. 


TABLE 6-7 Instruction Cache 


Bits [20:16] of |. 
Operation Cache Inst* _| Description 


Index Inval 0001 | Invalidate the cache line at the specified index 
. 0011 | Read the cache line tag at the specified index into the TagHi/TagLo registers 
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Index Store Tag 0101} Write the cache line tag at the specified index from the TagHi/TagLo registers 
i | Invalidate the cache line at the specified address if it is present in the cache 
Index Load Data Read the contents of the data array into DataHi/DataLo registers (debug only) 
Index Str Data 010T 


a. [=00,D=01,T=10,S=11 
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Write the contents of the DataHi/DataLo registers into the data array (debug only 
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TABLE 6-8 Data Cache 


Cache Inst? 
Writeback dirty data and invalidate the cache line at the specified index 
Index Load Tag Read the cache line tag at the specified index into the TagHi/TagLo registers 


010D| Write the cache line tag at the specified index from the TagHi/TagLo registers 
Invalidate the cache line at the specified address if it 1s present in the cache 


101D| Writeback dirty data and invalidate the cache line at the specified address if it is 
present in the cache 


it WB 110D|Writeback dirty data and set the cache line state to clean at the specified address if 
the cache line is present 1n the cache 
0 
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Index Load Data 
Index Str Data Write the contents of the DataHi/DataLo registers into the data array (debug only) 


Se eapemanentnamnneeimnmampnnanemommnannnmmne eer aeneninmnseer™ emer reneate erm 


a. 1=00,D=01,T=10,S=11 


001S |Read the contents of the data array into DataHi/DataLo registers (debug only) 


The effective address for a CACHE instruction is calculated by adding the instruction offset field to a base 
register. The resulting address is translated by the TLB, and depending on the target cache, either the effective 
address or the translated address is used to access the cache. The process of translation may cause a TLB Refill 
or TLB Invalid exception but not a TLB Modified exception. In addition, the effective address for a CACHE 
instruction never generates a Watch exception, although address errors will be asserted for addresses that are not 
legal in the current operating mode. 


For the instruction cache, the effective address 1s used to access the cache. The address is translated regardless of 
the operation, but all TLB errors are suppressed, even though address errors may still result. Index operations 
use bits [12:5] to specify a set index and bits [14:13] to specify a way for the 32K cache on the SB-1 core. For hit 
operations, the ASID in the EntryHi register is coupled with the effective address to generate a virtual address, 
which is compared against the cache tags to detect a hit or miss. 


For the physically addressed data cache, the effective address is translated through the TLB for hit variants. TLB 
exceptions, due to the translation process, as well as address error exceptions may occur for these CACHE 
instructions. The resulting physical address 1s used to determine a hit or miss in the cache. Index operations, 
however, bypass the TLB, so no TLB exceptions will occur. Address errors may arise if the effective address is 
not valid for the current operating mode. Like the instruction cache, bits [12:5] indicate a set index, while bits 
[14:13] designate a way for index operations. 


In the SB-1 implementation, CACHE instructions ignore byte alignment. As such, address errors due to 
misalignment will never occur. 
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The Index Load/Store Data operations, defined for debug purposes, require additional bits to specify the double 
word location in a cache line. Bits [4:3] of the address are used for this purpose, so the operation effectively 
behaves like a double word load/store. 


The following sections describe the cache operations supported by the SB-1 core. 


6.3.2 Index Invalidate (1) 


The Index Invalidate variant sets the state of an instruction cache line at the specified index to invalid by clearing 
the valid and parity bits (VP = 00). The index is taken from the effective address bits [12:5] and the way is 
selected by bits [14:13]. The LRU remains unchanged and no parity check is performed. Address errors may 
occur for invalid addresses, but no TLB exceptions are raised. 


6.3.3 Index Load Tag (I) 


The Index Load Tag operation loads-the instruction cache TagHi and TagLo registers with the information stored 
in the I-Cache tag array. The tag index and way are taken from address bits [12:5] and [14:13], respectively. See 
the TagHi and TagLo definitions below for the format and data transferred by these registers. The LRU remains 
unchanged and no parity check is performed. Address errors may occur for invalid addresses, but no TLB 
exceptions are raised. The LU bit is set to one when an Index Load Tag is performed. 


6.3.4 Index Store Tag (I) 


The Index Store Tag operation reads the instruction cache TagHi and TagLo registers and stores the information 
into the cache tag array. The tag index and way are taken from address bits [12:5] and [14:13], respectively. See 
the TagHi and TagLo definitions in section 6.6 for the format and data transferred by these registers. Address - 
errors may occur for invalid computed addresses, but no TLB exceptions are raised. If the LU bit is set in the -. 
TagHi register, then the LRU is set to the state indicated by the LRU field; otherwise, it is set to a default state... 
Invalid LRU values will also be reset to a default state. See the LRU implementation notes below. A parity 
calculation is not performed by this operation, so the parity bits for the tag are taken directly from the P1, PO, and 
P bits in the Tag registers. 


6.3.5 Hit Invalidate (D 


The Hit Invalidate operation clears the state of an instruction cache line if the effective address and the ASID 
match a tag in the cache. A hit sets the valid and parity bits to zero (VP = 00). The LRU remains unchanged, and 
any detected parity errors are ignored. Address errors may occur for invalid addresses, but no TLB exceptions 
are raised. 
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6.3.6 Index Load Data (1) 


The Index Load Data operation is implemented purely for debug purposes and loads the instructions from the 
data array into the DataHi and DataLo registers. Two instructions are loaded into the DataLo register, while the 
parity and predecode for those instructions 1s written into the DataHi register. 


This operation is endian-neutral, so software must interpret the instructions correctly. For big-endian code, InstA 
contains instruction word 0 and InstB contains instruction word |. The opposite is true for little-endian code, so 
InstA contains word | and InstB contains word 0. This format allows software to move the double word data 
directly to a register and perform double word stores to code space without swapping. The LRU remains 
unchanged for Index Load Data CACHE instructions, and a parity check is not performed. Address errors may 
occur for invalid addresses, but no TLB exceptions are raised. 


6.3.7 Index Store Data (I) 


The Index Store Data operation is implemented purely for debug purposes and stores the instructions contained 
in the DataHi and DataLo registers into the instruction cache data array. The DataLo register contains two 
instructions, while the DataHi register includes the parity and predecode for those instructions. The predecode 
logic is bypassed for this operation, so the predecode bits for each instruction are taken from the register. 


This operation is endian-neutral, so software must write instructions into the InstA and InstB fields in the DataLo 
register for the appropriate endianness. For big-endian code, InstA contains instruction word 0 and InstB 
contains instruction word 1. The opposite is true for little-endian code, so InstA contains word 1 and InstB 
contains word 0. This format allows software to perform doubleword loads to code space and to move the 
doubleword data directly to DataLo without swapping. 


The LRU remains unchanged for Index Store Data CACHE instructions. A parity calculation is not performed 
by this operation, so the parity bits for the data are taken directly from the IPA and IPB fields. The predecode 
bits and predecode parity must also be calculated by software and placed in the DataHi register. Address errors 
may occur for invalid addresses, but no TLB exceptions are raised. 


6.3.8 Index Invalidate (D) 


The Index Invalidate variant sets the state of a data cache line at the specified index to invalid by clearing the 
state, coherent, and check bits to all zeros. The index is taken from the effective address bits [12:5] and the way 
is selected by bits [14:13]. The LRU remains unchanged and no parity check is performed. Address errors may 
occur for invalid addresses, but no TLB exceptions are raised. 


SB-1 Users Manual 6-103 


CACHE Instructions — | SiByte Confidential 


6.3.9 Index Load Tag (D) 


The Index Load Tag operation loads the data cache TagHi and TagLo registers with the information stored in the 
cache tag array. The tag index and way are taken from address bits [12:5] and [14:13], respectively. See the 
TagHi and TagLo definitions below for the format and data transferred by these registers. The LRU remains 
unchanged and no parity check is performed. Address errors may occur for invalid addresses, but no TLB 
exceptions are raised. The LU bit is set to one when an Index Load Tag is performed. 


6.3.10 Index Store Tag (D) 


The Index Store Tag operation reads the data cache TagHi and TagLo registers and stores the information into the 
cache tag array. The tag index and way are taken from address bits [12:5] and [14:13], respectively. See the 
TagHi and TagLo definitions below for the format and data transferred by these registers. Address errors may 
occur for invalid addresses, but no TLB exceptions are raised. 


If the LU bit is set in the TagHi register, then the LRU is set to the state indicated by the LRU field: oe it 
is set to a default state. Invalid LRU values will also be reset to a default state. See the LRU eee 
notes below. 


A parity calculation is not performed by this operation, so the parity bits for the tag are taken directly from the P1 
and PO bits in the TagLo register. In addition, the state check bits are sourced directly from the TagHi register. 


6.3.11 Hit Invalidate (D) 


The Hit Invalidate operation clears the state of a data cache line if the translated physical address matches a tag in 
the cache. A hit sets the state, coherent, and check bits to zero. The LRU remains unchanged. Address errors 
may occur for invalid addresses, and TLB exceptions may be raised as a result of the address translation. 


Parity errors detected by this operation leave the state of the data cache unchanged. In addition, any cache error 
exceptions will be taken imprecisely. 


6.3.12 Hit Writeback Invalidate (D) 


The Hit Writeback Invalidate operation causes a cache line to be written back to memory if the translated 
physical address matches a tag in the cache and the data are dirty. Additionally, the state of the line is cleared, 
and the state, coherent, and check bits are set to zero. The LRU remains unchanged. Address errors may occur 
for invalid addresses, and TLB exceptions may be raised as a result of the address translation. 


Tag parity errors detected by this operation leave the state of the data cache unchanged, and no writebacks to the 
bus occur. In addition, any cache error exceptions will be taken imprecisely. 
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A single-bit ECC error detected by this operation is corrected, and the data are written into memory with a 
corrected data code. A double-bit ECC error detected by this operation is not corrected, and the data are written 
to memory with an uncorrected data code. In either case, any cache error exceptions will be taken imprecisely. 


6.3.13 Hit Writeback (D) 


The Hit Writeback operation causes a cache line to be written back to memory if the translated physical address 
matches a tag in the cache and the data are dirty. Additionally, the state of the line is modified to clean, and the 
the data are retained in the cache. For coherent lines, the state becomes 0b10 and the check bits change to 0b11. 
For non-coherent lines, the state becomes 0b10 and the check bits change to Ob10. The LRU remains unchanged. 
Address errors may occur for invalid addresses, and TLB exceptions may be raised as a result of the address 
translation. 


Tag parity errors detected by this operation leave the state of the data cache unchanged, and no writebacks to the 
bus occur. In addition, any cache error exceptions will be taken imprecisely. 


A single-bit ECC error detected by this operation is corrected, and the data are written into memory with a 
corrected data code. A double-bit ECC error detected by this operation is not corrected, and the data are written 
to memory with an uncorrected data code. In either case, any cache error exceptions will be taken imprecisely. 


6.3.14 Index Load Data (D) 


The Index Load Data operation is implemented purely for debug purposes and loads doubleword data and ECC 
from the data array into the DataHi and DataLo registers. The data are loaded into the DataLo register, while the 
ECC bits for the data are written into the DataHi register. The index and way for the operation come from bits 
[12:5] and [14:13], respectively. 


The LRU remains unchanged for Index Load Data CACHE instructions, and an ECC check is not performed. 
Address errors may occur for invalid addresses, but no TLB exceptions are raised. 


6.3.15 Index Store Data (D) 


The Index Store Data operation is implemented purely for debug purposes and stores the doubleword data and 
ECC contained in the DataHi and DataLo registers into the data cache data array. The DataLo register contains 
the doubleword data, while the DataHi register includes the ECC for the data. The index and way for the 
operation come from bits [12:5] and [14:13], respectively. 


The LRU remains unchanged for Index Store Data CACHE instructions. An ECC calculation is not performed 
by this operation, and the ECC bits for the data are taken directly from the DataHi register. Address errors may 
occur for invalid addresses, but no TLB exceptions are raised. 
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6.4 Cache Operation Effects on Duplicate Tags 


This section is TBD. 


6.5 CACHE Instruction Issue Rules 


CACHE instructions are serially issued, i.e. all previous instructions must graduate so that potential mispredicts 
and exceptions are cleared before the operation executes. In addition, the CACHE operation performs an 
implicit memory synchronization since outstanding loads and stores (and even other CACHE instructions) may 
update the cache state. An implicit memory synchronization follows the CACHE operation as well so 
subsequent loads and stores can observe the effect of the CACHE instruction. 


Note that the synchronization does not apply to instruction accesses, so the result of a CACHE operation on the 


instruction cache is unpredictable if the effective address generated by the CACHE operation is near a potential 
cached instruction fetch path. — 


6.6 Register Definitions 


This following sections cover the tag and data registers supported by SB-1 core. 
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6.6.1 Tag Registers (MIPS Compliant) 


TABLE 6-9 TagLo Register: Register 28, Select 0 (Instruction Cache) 
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pits_| size Field [Description 
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LRU jLeast Recently Used Pointer 


— — ww — — — 


Read as zeros; ignored on write 


TABLE 6-11 TagLo Register: Register 28, Select 2 (Data Cache) 


pits _|Size|Field |Description 


[63:40] 24d ]0 | Read as zeros; ignored on write 
[39:26] Physical Address bits [39:26] 
[25:13] Physical Address bits [25:13] 
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TABLE 6-11 TagLo Register: Register 28, Select 2 (Data Cache) 


Bits _|size|Field |Description 
[121 |i [0 [Readas zor ignored on we 
(tt |ie_[P_ [Pap ic even pany orPah 
(201_|ve_[PO [Pay Bi even pay for Pago 
6-01 [ioe Jo [eas eo: gnored on write 


TABLE 6-12 TagHi Register: Register 29, Select 2 (Data Cache) 


Size|Fieta [Description 


1:30] 


C 


xtNC | Not cached in external Caches (e.g. L2) 


Stream | Stream Bit 


oh 


3:0] 4b (0 Read as zeros; ignored on write 


Coherent-Exclusive-Clean 
oo Coherent-Exclusive-Dirty 


a. All other combinations are error 
combinations 


LRU Implementation Notes: 
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The LRU pointer contains four 2-bit entries corresponding to the MRU to LRU ways, 1.e. the two most- 
significant bits indicate the MRU way while the two least-significant bits indicate the LRU (LRU acts as a 
FIFO.) For the LRU to be valid, each entry must contain a unique 2-bit way number, which results in a total of 
24 valid combinations. The LRU pointer is corrected by the cache when one of the invalid combinations is 
detected; the default value for the error case forces the entries, from MRU to LRU, to the following: way 3, way 
2, way 1, way 0. The default value is also written during an Index Store Tag when the LU bit in the TagHi 
register is clear. In addition, the LU bit is read as a one when an Index Load Tag operation is performed. 


6.6.2 Data Registers (SiByte Debug Defined) 
The following tables define the Data Register portion of CACHE operations supported by SB-1. 


TABLE 6-14 DataLo Register: Register 28, Select 1 (Instruction Cache) 


Bits _ | Size [Field |Description 
352 
B10 


TABLE 6-15 DataHi Register: Register 29, Select 1 (Instruction Cache) 


Bits _[Size|Field |Description 
[63:17] 47> }O | Read as zeros; ignored on write 
[16] Hb |PDP | Predecode Parity Bit; even parity for PDA and PDB 


Field 


63:0] }64b |Data 
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TABLE 6-17 DataHi Register: Register 29, Select 3 (Data Cache) 


a 


:8}|56b |O Read as zeros; ignored on write 
0} | 8b | ECC |Cache Data ECC 
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6.6.3 Cache Coherency Attributes 


Table 6-18 shows the Cache Coherency Attributes supported in SB-1. The “C Field” shown below is part of 
EntryLo0 and EntryLol registers in CPO (Registers 2 and 3, Select 0). Refer to MIPS64 Specification for 
additional detail regarding CPO Registers. 


TABLE 6-18 SB-1 Cache Coherency Attributes 


C(5:3) | Cache Coherency Attributes With Historical Usage SB-1 Assignment 


Available for implementation dependent use Cacheable Coherent: 
Exclusive in L1, Uncacheable in L2. 


Historical usage: 

- Reserved (R4000®, VR5400, R10000®) 

- Unused, defaults to cached (R4300™) 

- Cacheable, noncoherent, write through, no wnte allocate (RC32364, RM5200) 


Similar to C = 4, but do not allocate in L2. 


Cacheable Coherent: 
Shared in L1, Uncacheable in L2. 


Available for implementation dependent use 


Historical usage: 

- Reserved (R4000) 

- Unused, defaults to cached (R4300) 
- Cacheable, noncoherent, write through, write allocate(RC32364, RM5200) 
- Cacheable write-through, write allocate (VR5400) 


Uncached 
Historical usage: 


Similar to C = 5, but do not allocate in L2. 


Uncached 


- Uncached (all processors) 


Cacheable 
Historical usage: 

- Cacheable noncoherent (noncoherent) (R4000, R10000) 

- Cached (R4300) 

- Cacheable, noncoherent (writeback) (RC32364, RM5200) 
- Cacheable, writeback (VR5400) 


Cacheable Noncoherent 


Available for implementation dependent use Cacheable Coherent Exclusive 


Historical usage: Line is always fetched exclusive. 
- Cacheable coherent exclusive (exclusive) (R4000, R10000) 
- Unused, defaults to cached (R4300) 


- Reserved (RC32364, RM5200, VR5400) 


Available for implementation dependent use 


Cacheable Coherent Sharable 


Line is fetched shared on a load miss, 
exclusive on a store miss. 


Historical usage: 
- Cacheable coherent exclusive on write (sharable) (R4000, R10000) 
- Unused, defaults to cached (R4300) 

- Reserved (RC32364, RM5200, VR5400) 


Line is upgraded to exclusive if it 1s 
fetched shared but no other processor has 
it. 


i 
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TABLE 6-18 SB-1 Cache Coherency Attributes 


C(5:3) | Cache Coherency Attributes With Historical Usage SB-1 Assignment 


Not Used 


Available for implementation dependent use 


Historical usage: 
- Cacheable coherent update on write (update) (R4000) 
- Unused, defaults to cached (R4300) 

- Reserved (RC32364, RM5200, R10000) 


Available for implementation dependent use Uncached Accelerated: 


Historical usage: Merge in Write Buffer 
~ Reserved (R4000) 


- Unused, defaults to cached (R4300) 
- Reserved (RC32364, RM5200) 


- Uncached accelerated (VR5400, R10000) 
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CHAPTER 7 Virtual Memory Address Space 
and the TLB Format 


7.1 Introduction 


Specification. 
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7.2 Supported Memory Address Space in SB-1 


SiByte Confidential 


Table 7-1 shows the MIPS64 virtual memory address space, as supported in SB-1. The supported physical 


address space in SB-1 is 40 bits wide and the virtual address space is 44 bits wide}. 


TABLE 7-1 Virtual Memory Address Space 


Maximum Address Range | Address M 


ode 

OxFFFF FFFF FFFF FFFF 
OxFFFF FFFF E000 0000 
OxFFFF FFFF DFFF FFFF 

through Always Supervisor 
OxFFFF FFFF C000 0000 
OxFFFF FFFF BFFF FFFF 

through Always 
OxFFFF FFFF A000 0000 
OxFFFF FFFF 9FFF FFFF 

through Always 
OxFFFF FFFF 8000 0000 


Supervisor 
Kernel 


OxFFFF FFFF 7FFF FFFF 
Address 
Eevop through 
0xC000 OFFF 8000 0000 
0xC000 OFFF 7FFF FFFF 
xkseg through KX 
0xC000 0000 0000 0000 
OxBFFF FFFF FFFF FFFF 
10 xkphys through KX Kernel 
0x8000 0000 0000 0000 
Ox7FFF FFFF FFFF FFFF 
Address 
ror through 
si 0x4000 1000 0000 0000 
0x4000 OFFF FFFF FFFF 
xsseg 
through SX Supervisor 
xksseg 
0x4000 0000 0000 0000 


Supervisor 
Kernel 
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Associated with | Reference Legal | Actual Segment 
from Mode(s) 


29 
32-bit 
29 


Segment 
Type 


“32-bit 


32-bit 


32-bit 
Compatibility 


8 x 2” Byte 
regions within 
the 2° Byte 
Segment 
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TABLE 7-1 Virtual Memory Address Space 


Ox3FFF FFFF FFFF FFFF 
through 
0x0000 1000 FFFF FFFF 
xuseg | 0x0000 OFFF FFFF FFFF 
xsuseg through 
xkuseg | 0x0000 0000 8000 0000 
0x0000 0000 7FFF FFFF 
through 
0x0000 0000 0000 0000 


Address 
Error 


User 
Supervisor 
Kernel 


User 


Supervisor 


Kernel 


SiByte Confidential 


Segment 
Type 


Sai 32-bit 
sy Compatibility 


1. PABITS and SEGBITS, respectively, as referenced in the MIPS64 Specification 
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Supported Memory Address Space in SB-1 


Figure 7-1 shows the virtual address space supported by SB-1. 


Supervisor 
Mapped 


FIGURE 7-1 SB-1 Virtual Address Space 
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32-Bit Compatibility Address Space 


Kernel 
Mapped 


Supervisor 
Mapped 


Kermel 
Unmapped 
Uncached 


Kernel 


Unmapped 


OxFFFF FFFF FFFF FFFF 
kseg3 


OxFFFF FFFF E000 0000 


sseg 
OxFFFF FFFF C000 0000 
kseg] 


OxFFFF FFFF A000 0000 


ksegO:: 
OxFFFF FFFF 8000 0000 


For reference purposes, Table 6-18 from Chapter 6 is repeated below to clarify the cache coherency attribute 
encoding (CCA field <61:59> of virtual address) used in constructing the full virtual address. 
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TABLE 7-2 SB-1 Cache Coherency Attributes 


C(5:3) | Cache Coherency Attributes With Historical Usage SB-1 Assignment 


Available for implementation dependent use Cacheable Coherent: 


Historical usage: Exclusive in L1, Uncacheable 1n L2. 
- Reserved (R4000®, VR5400, R10000®) 
- Unused, defaults to cached (R4300™) 


- Cacheable, noncoherent, write through, no write allocate (RC32364, RM5200) 


Similar to C = 4, but do not allocate in L2. 


Cacheable Coherent: 
Shared in L1, Uncacheable in L2. 


Available for implementation dependent use 


Historical usage: 

- Reserved (R4000) 

- Unused, defaults to cached (R4300) 
- Cacheable, noncoherent, write through, write allocate(RC32364, RM5200) 
- Cacheable write-through, write allocate (VR5400) 


Uncached 
Historical usage: 


Similar to C = 5, but do not allocate in L2. 


Uncached 


- Uncached (all processors) 


Cacheable 
Historical usage: 

- Cacheable noncoherent (noncoherent) (R4000, R10000) 

- Cached (R4300) _ 

- Cacheable, noncoherent (writeback) (RC32364, RM5200) 
- Cacheable, writeback (VR5400) 


Cacheable Noncoherent 


Available for implementation dependent use Cacheable Coherent Exclusive 


Historical usage: Line is always fetched exclusive. 
- Cacheable coherent exclusive (exclusive) (R4000, R10000) 
- Unused, defaults to cached (R4300) 


- Reserved (RC32364, RM5200, VR5400) 


Cacheable Coherent Sharable 


Line is fetched shared on a load miss, 
exclusive on a store miss. 


Available for implementation dependent use 
Historical usage: 

- Cacheable coherent exclusive on write (sharable) (R4000, R10000) 
- Unused, defaults to cached (R4300) 

- Reserved (RC32364, RM5200, VR5400) 


Line is upgraded to exclusive if it is 
fetched shared but there is no sharing. 
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TABLE 7-2 SB-1 Cache Coherency Attributes 


C(5:3) | Cache Coherency Attributes With Historical Usage SB-1 Assignment 


Available for implementation dependent use Not Used 


Historical usage: 


- Cacheable coherent update on write (update) (R4000) 
- Unused, defaults to cached (R4300) 
- Reserved (RC32364, RM5200, R10000) 


{Available for implementation dependent use Uncached Accelerated: 


Historical usage: Merge in Write Buffer 
- Reserved (R4000) 

- Unused, defaults to cached (R4300) 
- Reserved (RC32364, RM5200) 


- Uncached accelerated (VR5400, R10000) 


7.3 The TLB 


Table 7-3 shows the organization of the Translation Lookaside Buffer in SB-1. 


TABLE 7-3 TLB Organization in SB-1 


Specifies 


tr 
GO} 
OQ 
n 
aS 
as 
° 
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7.3.1 TLB Entry Format 
Figure 7-2 shows the TLB Format supported in SB-1. 


217 216 205 204 


191 190189 172 171 


FIGURE 7-2 TLB Entry Format in SB-1 
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CHAPTER & The CP0 Architecture 


8.1 Introduction 


SB-F° The MIP64 Specification provides two sets of 
red registers, refer to the MIPS64 Specification. This 
and optional fields within Required registers as 


This chapter provides the list of CPO registers supported 
CP0 registers, Required and Optional. For detailson, 
chapter elaborates mainly on the list of @ptiona: 
supported in SB-1 core. 


Reference: 
MIPS64 Specification 


Register 
Name 


Compliance 
Level 


Index Index into the TLB entry Section 4.9.1, pg. 105 
Randomly generated index into the TLB array Section 4.9.2, pg. 106 


2 EntryLoO Low-order portion of the TLB entry foreven- {Required Section 4.9.3, pg. 107 
numbered virtual pages 

3 EntryLol Low-order portion of the TLB entry for odd- | Required Section 4.9.3, pg. 107 
numbered virtual pages 
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TABLE 8-1 List of CPO Registers in SB-1 


Register Register Hancivn Compliance Reference: 

Number Name Level MIPS64 Specification 
Context Pointer to page table entry in memory Section 4.9.4, pg. 110 

PageMask Control for variable page size in TLB entries Section 4.9.5, pg. 111] 


Controls the number of fixed (“‘wired”) TLB Section 4.9.6, pg. 112 
entries 


ad VAddr Reports the address for the most recent address- | Required Section 4.9.7, pg. 113 
related exception 


wo 


@ 
a 


: 


Cause Cause of last general exception Section 4.9.12, pg. 123 


: 


Instruction Watchpoint address 


1 


0 | | 

10: , 

ro | 

ce a oe 
ro : 

i . 


Implemented in SB-1 Section 4.9.18, pg. 132 


Also, refer to the Debug 
Architecture Chapter in this 
manual. z 


Section 4.9.18, pg. 132 


Also, refer to the Debug 
Architecture Chapter in this 
manual. | 


Section 4.9.19, pg. 134 


Also, refer to the Debug 
Architecture Chapter in this 
manual. 


Section 4.9.19, pg. 134 


Also, refer to the Debug 
Architecture Chapter in this 
manual. 


Context Extended Addressing Page Table Context Section 4.9.20, pg. 135 
Reserved for future extensions 
Performance Event register Implemented in SB-1 Section 4.9.21, pg. 136 


WatchLo 


Data Watchpoint address Implemented in SB-1 


re ies tT] 
ZTETSlolSI& c 
as | > 24 e) a 
E | & | | oe 2a 
oy & se 


19 WatchHi Instruction Watchpoint control 


Implemented in SB-1 


WatchHi Data Watchpoint control Implemented in SB-1 


2 
2 
2 


4 
5 
7 
9 
11 
12 
13 
14 
15 
16 
16 
17 
18 
19 
0 
1 
2 


l 

l 
Eo 
ail 
all 
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TABLE 8-1 List of CPO Registers in SB-1 


Register Sel Register Functi Compliance Reference: 
Number | °° Name meee Level MIPS64 Specification 
3 


EJTAG Debug register Implemented in SB-1 EJTAG v2.5 Specification 


Also, refer to the Debug 
Architecture Chapter in this 
manual. 


EJTAG v2.5 Specification 


Also, refer to the Debug 
Architecture Chapter in this 
manual. 


Extended Debug register Implemented in SB-1 


24 DEPC Program counter at last EJTAG debug Implemented in SB-1 EJTAG v2.5 Specification 
execpuen Also, refer to the Debug 
Architecture Chapter in this 
manual. 
PerfCnt Performance counter interface Implemented in SB-1 Section 4.9.24, pg. 137 


Also, refer to the 
Performance Monitoring § 
Architecture Chapter in this 
manual. 3 


Section 4.9.25, pg. 140 


Also, refer to the Error 
Handling Chapter 1n this 
manual. 


Section 4.9.25, pg. 140 


Also, refer to the Error 
Handling Chapter in this 
manual. 


Section 4.9.26, pg. 140 


Also, refer to the Error 
Handling Chapter in this 
manual. 


Section 4.9.26, pg. 140 


Also, refer to the Error 
Handling Chapter in this 
manual. 


Section 4.9.26, pg. 140 


Also, refer to the Error 
Handling Chapter in this 
manual. 


nN 
n 


Parity/ECC error control and status Implemented in SB-1 


i) 
MN 


Data Bus Error Physical Address Implemented in SB-1 


NO 
ox 
wo 
= 
Nn 
es) 
ia 
% 
> 


Instruction Cache error control and status Implemented in SB-1 


Data Cache error control and status Implemented in SB-1 


~ 


5 Data Cache Error Physical Address Implemented in SB-1 


te 
J 
OQ 
) 
QO 
io 
o 
try 
=| 
0 


Low-order portion of instruction cache tag Required Section 4.9.27, pg. 142 
interface 


DataLo Low-order portion of cache data interface Not Implemented in SB-1 | Section 4.9.28, pg. 143 
TagLoD Low-order portion of data cache tag interface Section 4.9.27, pg. 142 


co 


nN 
oo 
is 
ga 
co 
° 
ome) 


N 
oo 


~] 
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TABLE 8-1 List of CPO Registers in SB-1 
Register Register Function Compliance Reference: 
Number Name | Level MIPS64 Specification 


TagHil High-order portion of instruction cache tag Required Section 4.9.29, pg. 143 
interface 
High-order portion of cache data interface Not = in SB-1 | Section 4.9.30, pg. 144 


TagHiD High-order portion of data cache tag interface Section 4.9.29, pg. 143 


Program counter at last error Section 4.9.31, pg. 144 


Also, refer to the Error 
Handling Chapter in this 
manual. 


EJTAG v2.5 Specification 


Also, refer to the Debug 
Architecture Chapter in this 
manual. 


EJTAG debug exception save register Implemented in SB-1 


8.2.1 Processor Status and Control (Status, CPO Register 12, sel0) 
Figure 8-1 shows the SB-1 Status and Control Register. 


31 , | | 1817 1615 | 0 
MIPS64 Defined Bits MIPS64 Defined Bits 


FIGURE 8-1 SB-1 Status and Control Register 


The SBX bit allows the execution of SiByte specific extensions to the standard MIPS64 instruction set 
architecture. Upon reset, this bit is set to 0, disabling the execution of extended instructions. If an extended 
instruction is executed with this bit set to 0, the processor will generate a Reserved Instruction exception. The 
list of extended instructions in SB-1 follows: 

MDMX: PAVG PABSDIFF, PABSDIFFC. Refer to Chapter 3 for additional details. 


Floating Point: DIV.PS, RECIP.PS, RSQRT-.PS, SQRT.PS. Refer to Chapter 4 for additional details. | 
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8.2.2 Processor Identification and Revision (PRId, CP0 Register 15, sel0) 


Figure 8-2 shows the format of PRId register in MIP64 architecture. 


31 24 23 1615 8 7 0 


FIGURE 8-2 PRId Register Format 


This is a 32 bit read-only register, factory preset, that contains information identifying the manufacturer, 
manufacturer options, processor identification and revision level of the processor. Table 8-2 presents the value of 
this register in SB-1. 


TABLE 8-2 PRId Register Fields in SB-1 


Rita [Description Setting 


Revision Specifies the revision number of the processor. This field allows | 0x1 
software to distinguish between one revision and another of the 
same processor type. (Initial Value) 


Identifies the type of processor. This field allows software to ae 


distinguish between various processor implementations within a 
single company, and is qualified by the CompanyID field, 
described above. The combination of the CompanyID and 
ProcessorID fields creates a unique number assigned to each 
processor implementation. 
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PRId<24> = MP Bit 
PRId<24> = 0x0 for Uniprocessor 


TABLE 8-2 PRId Register Fields in SB-1 


CompanyID Identifies the company that designed or manufactured the 
processor. Software can distinguish a MIPS32 or MIPS64 
processor from one implementing an earlier MIPS ISA by 
checking this field for zero. If it is nonzero the processor 
implements the MIPS32 or MIPS64 Architecture. Company IDs 
are assigned by MIPS Technologies when a MIPS32 or MIPS64 
license is acquired. The encodings in this field are: 


0x0: Not a MIPS32 or MIPS64 processor 
0x1: MIPS Technologies, Inc. 
0x4 : SiByte, Inc. 


Available to the designer or manufacturer of the processor for 
company-dependent options. The value in this field is not 
specified by the architecture. 


PRId<24> = Ox] for Multiprocessors 


PRId<27:25> = Processor Number 

0x0 through 0x7, up to 8 processors 

If PRId<24> == 0, then the only valid value 
for PRId<27:25> is 0x0 


PRId<31:28> = SiByte Reserved 


8.2.3 Configuration Register (Config, CP0 Register 16, sel0) 


Bits 16 through 30 of this register are reserved for implementation by MIPS64 ISA. SB-1 uses bits [19:16] of 
Config register to implement the multiprocessor vector offset bits, called MPV. This allows each processor to 
"shift" its exception block by 64KB, i.e. bits [19:16] of the exception vector are selected directly from this 
register field. At Reset, these bits are set to 0. These bits are read/write. 


Figure 8-3 shows the bits taken by this field in the Config Register. 


31 30 20 19 1615 0 


; SB-1 Reserved MIPS64 Defined Bits 


FIGURE 8-3 MPV Field in SB-1 Config Register 
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Exceptions that occur to the boot block, e.g. vectors with OxBxxx_xxxx such as reset, NMI, debug, etc., are not 
affected by this offset, and as such, multiple copies do not need to be present in the ROM. 


8.2.4 Load Linked Address (LLAddr, CP0 Register 17, sel0) 
The LLAdadr register contains relevant bits of the physical address read by the most recent Load Linked 


instruction. 


This register is for diagnostic purposes only and serves no function during normal operation. The format of this 
register in SB-1 is shown in Figure 8-4. 


63 40 39 0 
, Physical Address 


FIGURE 8-4 LLAddr Register Format in SB-1 


Table 8-3 describes the LLAddr register fields. 


TABLE 8-3 LLAddr Register Field Descriptions 


PAddr |39:0 | This field encodes the physical address read by the most recent R/W Undefined 
Load Linked instruction. 


9 


8.2.5 Watchpoint Address (WatchLo, CPO Register 18, sel0-n) 


For details, refer to the chapter on Debug Architecture in this manual. 


8.2.6 Watchpoint Control (WatchHi, CP0 Register 19, sel0-n) 


For details, refer to the chapter on Debug Architecture in this manual. 


8.2.7 EJTAG Debug Register (Debug, CPO Register 23, sel0) 


For details, refer to the chapter on Debug Architecture in this manual. 


SB-1 Users Manual | 8-127 


Overview of CPO Registers | SiByte Confidential 


8.2.8 Program Counter at Last EJTAG Debug Exception (DEPC, CPO Register 24, sel0) 


For details, refer to the chapter on Debug Architecture in this manual. 


8.2.9 Performance Counter Interface (PerfCnt, CP0 Register 25, sel0) 


For details, refer to the chapter on Performance Monitoring Architecture in this manual. 


8.2.10 Parity/ECC Error Control and Status (ErrCtl, CP0 Register 26, sel0) 


For details, refer to the chapter on Error Handling in this manual. 


8.2.11 Cache Error Control and Status (CacheErr, CP0 Register 27, sel0-3) 


For details, refer to the chapter on Error Handling in this manual. 


8.2.12 Low-order Portion of Cache Data Interface (DataLo, CP0 Register 28, sel1) 
This CPO register is not implemented in SB-1. 


8.2.13 High-order Portion of Cache Data Interface (DataHi, CP0 Register 29, sel1) 
This CPO register is not implemented in SB-1. 


8.2.14 EJTAG Debug Exception Save Register (DESAVE, CP0 Register 31, sel0) 


For details, refer to the chapter on Debug Architecture in this manual. 
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8.3 Privileged Resource Hazards 


This section details the hazards surrounding the SB-! privileged resources. Specifically. the use of privileged 
resources, such as CPO registers, the TLB, and cache state. and the execution of privileged instructions. such as 
MTCO and MFCO, are covered. 


8.3.1 Privileged Resources and Instructions 


The SB-1 privileged resources include the Coprocessor 0 registers. the Translation Lookaside Buffer. and the 
Instruction and Data Cache Tags. 


Some CP0 registers serve as the interface between software and the hardware resource. For TLB access. the 
following CPO registers are used: Index (0), Random (1), EntryLo0 (2), EntryLol (3), PageMask (5), and _ 
EntryHi (10). For cache tag access, the TagLo-I (28, sel. 0) and TagHi-I (29, sel. 0) interface with the instruction 
cache, and TagLo-D (28, sel. 2) and TagHi-D (29, sel. 2) interface to the data cache. 


The following table outlines the resources required by the privileged instructions defined by the MIPS64 ISA. 
The Inst column indicates the type of struction, while the Source and Destination columns list the required 
resources and the resources updated by the instruction. 


TABLE 8-4 Resources Required by MIPS64 Privileged Instructions 


Source | 


TLB, Cache Tags, TagLo/TagHi 
Status, EPC, ErrorEPC 

Debug, DEPC 

DMFCO0/MFCO | CPO Register 

DMTCO/MTCO 


Inst Destination 


Cache Tags, TagLo/TagHi 
Status, PC 
Debug, PC 


& 


z Om 

alS/ 3/218 |4 
ey 

SRG 

= 

a 

- 


T 


CPO Register 
TLB, EntryHi 
TLBP 
TLBR 


TLB, EntryHi Index 

TLB, Index TLB, PageMask, EntryHi/EntryLo 
Index, PageMask, EntryHi/EntryLo |TLB 

TLB 


ae a Aer 
ol ee 
wi Ww 
ve] at 


Random, PageMask, EntryHi/EntryLo 
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In addition. some processor activities implicitly require privileged resources to operate. These are listed below: 


TABLE 8-5 Processor Activities Requiring Privileged Resources 


st Feches[CacheTas LB On| 
[oats [TUB CocteTass | 
Sures__[TU8, Cote Toes | 


8.3.2 Privileged Resource Hazards 


Whenever an instruction writes a result to a resource required by a subsequent instruction or action, a hazard 
exists. This is commonly known as a Read-After-Write (RAW) hazard. (Other types of hazards, such as WAW 
and WAR, are not visible in the SB-1’s implementation of the privileged resources.) In the SB-1, all privileged © 
instructions and memory actions such as loads and stores are interlocked by serializing the dependent operations. 
As aresult, no SSNOPs are required between these types of operations. 


The privileged instructions and other operations can be classified into several groups, as shown below: 


TABLE 8-6 Operation Grouping of Privileged and Miscellaneous CPU Operations 


Emyti[MTOO Cros [wove ENA —_| 


MemOp | Loads and Stores Load and Store Operations 


Note that in this classification. a move to the CP0.EntryHi register is a special case due to its effects on the 
implementation, namely the TLB. 
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The following matrix outlines the implemented interlocks for each pair of operation groups. To eliminate the 
hazards, the SB-1 effectively serializes the instructions from each group that may result in a hazard. Note that no 
hazards exist for a group followed by a CPOWr or EntryHi (these are WA* hazards). 


TABLE 8-7 Implemented Interlocks for Each Pair of Operation Groups* 


HW Interlock HW Interlock 
crow | ners |i neat nso neck 


a. Hardware interlock implies serialization 


Instruction cache operations are not interlocked with respect to the instruction fetch. As a result, care must be 
taken when executing these operations since they may have unpredictable results. It is recommended that 
instruction cache operations only be executed from uncacheable space. 


8.3.3 CPO Register Side-Effects 


Although the above operations are interlocked, side-effects of writing CPO registers are not interlocked. In 
general, the CPO write takes effect in stage 8 of the pipeline; however, the usage of the CPO data may take place 
earlier in the pipeline. Thus, there is a "shadow" of some number of instructions between the update and the 
update’s being observed. These hazards are divided into fetch hazards and execution hazards. 


(Note that updates of CPO state that are invisible to software, such as exceptions, or that are implicit in the 
execution of certain instructions, such as an ERET, are handled by the hardware, so no SSNOPSs are required for 
proper operation.) 


8.3.3.1 Fetch Hazards 


Because the instruction fetch is decoupled from the execution of instructions, there is no way to guarantee the 
timing between the CPO register write and the instruction fetch. To eliminate this type of hazard, an ERET 
instruction must be executed between the CPO write and the first instruction that should observe the update. In 
the SB-1, there are no cases that require placing SSNOPs before the ERET instruction. 
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Writes to CPO registers that affect instruction fetch are listed below by register: 


TABLE 8-8 CPO Registers that Affect Instruction Fetch 


Action 


Inst Cache Lookup/Inst Watch Exception 


Cc 


Coprocessor Usable Exceptions 
Instruction Fetch Endinanness 
MDMxX Usable Exceptions 

Rsvd Inst for 64b User Instructions 
Address Error/TLB Refill Exceptions 


* 


Address Error Exceptions 


Address Error/Inst Watch Exceptions 


i) 


A, mM) A 
S| DH 
eae ee 
es 
a 


Instruction Cache Lookup 


All Instruction Watch Exceptions 


WatchHi-! All | Instruction Watch Exceptions 


In addition, installing a new TLB entry for the instruction fetch requires executing an ERET to ensure that the 
TLB state has been updated properly. 


8.3.3.2 Execution Hazards 


Execution hazards occur between the time of the CPO register write and the time of the CPO register use after the 
issue stage. The following table indicates when certain instructions or actions require and/or generate CP0 state:. 
Note that TLB operations and registers, as well as CACHE instructions, are omitted since they are fully 


interlocked. 


TABLE 8-9 Required/Generated CP0 States by SB-1 Instructions and Activities 


Operands Results 


CPO Register 


Status.ERL 
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TABLE 8-9 Required/Generated CP0 States by SB-1 Instructions and Activities 
Operands Results 


Inst/Event 

Status.EXL 
Config.KO 
WatchLo-D 
WatchHi-D 
Status.FR 
Status. SBX 
Status. BEV 
Cause.IV 
Config.MPV 


Interlock 
Interlock 
Interlock 
Interlock 
FP 64b Registers 
SiByte Ext 


~~] 


Exception 
Status. EXL 
Cause.BD 
Cause.CE 


Cause.ExcCode 


ErrorEPC 
Context 
BadVAddr 
EntryHi 


Mem Exception (including above) 


XContext 
Status.IM 
Status.ERL 
Status.EXL 
Status.IE 
Cause.IP 


Interrupt See Exception Above 


Timer Int Count 
Compare 
Status.EXL 
Status. ERL 
Cause.WP 
EP 


ErrorEPC 


Def Watch Exc (Data) Cause. WP 


See Exception Above 


£2 


ERET Status.ERL 


Debug Exception Debug.DBD 
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TABLE 8-9 Required/Generated CP0 States by SB-1 Instructions and Activities 


Operands Results 


a 
eA 
5 
Ps 
2 


es 
es 
es 
rs 
ee 
es 
ee 
es 
ecasiersn! 

Debus Chea |e 
om 
i 


Debug.CacheEP 


DERET DEPC Debug.DM 
Debug. IEXI 


| ena! § 
= 
og 
=t 
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The pipestages for each event can be used to calculate the required separation (in cycles) between two events. 
Note that the SB-1 is a superscalar machine, so cycles do not necessarily equal instructions. To make the number 
of instructions equal the number of cycles, the SSNOP instruction can be used since this instruction forces 


single-issue for itself. 
To calculate the separation, the following formula can be used: 


Separation = Pipestage(Result) - Pipestage(Operand) - 1 


For example, a MTCO instruction that modifies the RE bit must occur a certain number of cycles before a 
subsequent load or store. The number of cycles, or SSNOPs, is 8 (MTCO write) - 0 (Load/Store use) - 1, or 7. 
Likewise, a MTCO that enables interrupts may cause an interrupt to be taken on an instruction issued four cycles 


later: 
Separation = 8 (MTCO write) - 4 (IE use) - 1 =3 


T MTCO rl, CPO.Status 
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T+l1 


T+2 


T+3 


T+4 


SiByte Confidential 


SSNOP 
SSNOP 
SSNOP 


(interrupt seen here) 


In some cases, the separation may be 0 cycles, such as between a MTCO to the EPC register followed by an 
ERET. In the SB-1, privileged instructions are always the oldest instruction issued in a particular cycle. Asa 
result, privileged instructions are effectively pipelined, so SSNOPs do not need to be used to ensure that two 
privileged instructions are issued in different cycles. With this behavior and with interlocks, privileged 
instruction sequences need not include intervening instructions. 


The cycle separation calculated by using the above table indicates the maximum shadow resulting from a 
particular pair of operations. Use of the dependent operation within that shadow may result in 3 


UNPREDICTABLE behavior. 
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CHAPTER 9 The Debug Architecture 


9.1 Introduction 


‘he debug features of the SB-1 are mostly software 


This document covers the SB-1 core debug implemer 
a é SB-1 core to debug their software and hardware. 


in nature and allow customers programming or 1 ting 


9.2 Debug Features 


The SB-1 core provides custo everal debug features that aid in the development of hardware and software 
systems. Included in the 
compliant EJTAG sub 
which includes an ex 


TAP. 


‘signaling interface and an alternate debug vector which enables loading code over the 


9.2.1 Watch Registers 


Two Watch register pairs (WatchHi and WatchLo) are implemented by the SB-1 core to trap on software- 
specified addresses. The first pair, selected when sel equals zero, can be used to break on instruction addresses 
while the second pair, selected when sel equals one, traps load or store accesses. 
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The Watch registers are implemented as specified in the MIPS64 document with the following exceptions. The 
WatchLo register, corresponding to select zero, always reads bits [1:0] as zero and ignores writes to those bits, 
because they are used exclusively for instruction references. Likewise, bit [2] of WatchLo register one is always 
read zero and ignored on writes, and bits [1:0] are used as enables since that register corresponds to data 
references only. 


TABLE 9-1 WatchLo/Hi Register Specifics 


Bits__|Field [Description __—_—_‘{ Access [Reset 
[21 ||__[nsoucion wach able ‘(RIO 
EC 
WacniiSdz0 SSS 


y 


WatchHi, Sel = 0 


A CN CO 


15:12] 


[Reserves 
is) [oak Mews Sid | 
o 


2:0] 


Ww 
© 
arene 


63:3] 


m1 N 
* 
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9.3 EJTAG 


The SB-1 core features a compliant subset of the MIPS EJTAG Version 2.5, functionality. EJTAG extends the 
operating modes, the ISA, and the CPO registers of a MIPS processor. 


In addition to the normal kernel, supervisor, and user modes, the EJTAG specification defines a special debug 
mode. In debug mode, several types of debug exceptions may be serviced, including single step instruction 
breaks and debug interrupts signalled by the external agent via the DINT pin. The debug mode may also be 
entered via the Software Debug Breakpoint instruction (SDBBP). 


EJTAG defines several CPO registers to hold debug state when a debug exception is encountered. The Debug 
register contains information about how the debug handler was entered, while the DEPC register holds the PC of 
the instruction that was executing when the debug exception occurred. To provide consistent state between 
debug exceptions, the DESAVE register is implemented and acts as a scratch register for the debug handler. 
When the handler is complete, a DERET instruction is executed, resuming the original program at the PC stored 
in DEPC. . . 


The EJTAG spec outlines the behavior of debug mode and single step instruction break (which is enabled for all 
modes except debug when the SSt bit is set) as well as the register definitions for the extended CPO registers. 
Since the SB-1 does not implement some EJTAG features, the Debug register is defined as follows (for an 
explanation of EDM, see the next section): 


TABLE 9-2 Debug Register: CPO Register 23, Sel = 0, EDM = 0 


pits _|rieta [Description [Access Reset 
[B_[DBD [Debug Branch Delay Sit R(X 
(60) [DM [Debug odestus—————~«dR ‘ito 
29 
[27] Doze | Doze Status 

01 |ea 
enforces 
uw 


IEXI Imprecise Error Exception Inhibit 


19] |DDBSImpr | Debug Data Brk St Impr Stat? 


2 
o 


Ss 


e 
aE 


| me | 
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TABLE 9-2 Debug Register: CPO Register 23, Sel = 0, EDM = 0 


Field |Description Access Reset 
18] | DDBLImpr | Debug Data Brk Ld Impr Stat® 


E 


C= 
os 


17:15] |EJTAGver | EJTAG Version (Version 2.5) 
0]|DExcCode | Debug Exception Code 


14: 


— 
promt 


\O 


ee em — 
~] 
foonarad 


] No Single Step Implemented 
Single Step Enable 


6 
] DINT Debug Interrupt Status 
[4] DB Debug Instruction Break Exception Status® 
[3] DDBS Debug Data Break Store Exception Status® 
: Cc 


Ga 


[2] DDBL Debug Data Break Load Exception Status 
Debug Breakpoint Exception Status 


Debug Single Step Exception Status 


ae 


Geb eMaeLGuee 
= 


a. These Debug Register bits are forced to 0 since the 
EJTAG memory region and break registers are not 
implemented 

b. R/W1 indicates that software may read the state of 
the bit but can only modify it by writing a one. 

c. These bits are always read as zero in standard debug 
mode, but when extended debug mode is enabled, 
they are used to indicate watch exception condi- 
tions. See below. 
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9.4 Extended Debug Mode 


In addition to EJTAG the SB-1 core implements a SiByte defined extended debug mode to enhance the 
processor’s debug capabilities. Extended debug mode uses the existing EJTAG and Watch registers and defines 
an extended debug mode register (EDebug) that controls the additional debug features. Addressed via the Debug 


register number with select equal to 3, the EDebug register contains the following fields: 


TABLE 9-3 EDebug Register(CP0 Register 23, Sel = 3) 


[B______[EDM__[EwendedDebog oceans [RW _[O 
[6]_______| sv _ | sStnabiew Pvp Modes 


a. See discussion about the DBBOOT signal below 


HULU WUE 
PEPE ER ETE 


Extended debug mode is controlled through the EDM bit in the EDebug register. When EDM is enabled, the 
Watch register pairs are used to generate debug exceptions rather than normal Watch exceptions. As a result, 
Watch register matches may no longer be deferred (the WP bit will never be set) and cause entry into the debug 
handler if the core is not already in debug mode. Upon entering debug mode, bits [4:2] of the Debug register are 


set depending on the type of Watch register match: 


TABLE 9-4 Debug, Sel = 0, EDM = 1 


EDIW Extended Debug Instruction Watch Status Ro |X 
EDDWS Extended Debug Data Watch StStaus |R |X | 
EDDWL |Extended Debug Data WatchLdStaus |R = |X | 


The EDebug register also allows software to select an alternate exception vector for debug exceptions. When the 
AltVec bit is set in the EDebug register, the processor jumps to instruction address 0xB000_0480 instead of the 
normal EJTAG debug exception vector. An external agent, like the SB-1250 SCD, may map the memory at the 
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physical address that results (0x00_1000_0480) to a JTAG probe so the debug handler and data can be delivered 
by the probe. Servicing the debug exception through the probe allows flexible implementation of the handler. 


Extended debug mode enables software to control single step more finely. The SStPrv bit enables single step in 
non-user modes (kernel, supervisor, EXL, and ERL) and may be cleared when EDM is enabled so single step 1s 
only active for user mode software. If EDM is disabled, or SSt is off, the state of this bit has no effect on single 
step. | 


Finally, the EDE bit in the EDebug register allows software and certain hardware events to signal that a debug 
event has occurred. The state of this bit is reflected in the EDEN signal, an SB-1 core output, which may be 
driven to an external agent to trigger an outside action. The EDE bit is always settable from software; however, 
if EDM is enabled, watchpoint events may set the bit as well, as long as the corresponding trigger bits are set... 
(bits [4:2] of the EDebug register). Clearing the EDE bit can only be done in software. 


Debug mode can be entered directly from reset if the DBBOOT signal is asserted during reset. This signal forces 
the SB-1 core to begin fetching from the alternate debug vector in debug mode when reset is deasserted. The 
table below indicates what state is set by the DBBOOT signal during reset: 


TABLE 9-5 Debug Reset Behavior 


ppnoor pM _|Attvec|Vector 


oJ | __[oxBrco_0000 
ff» [1 _Joxnooo 0880 


DBBOOT can also be used to force the state of the AltVec bit in the EDebug register. Asserting DBBOOT during 
normal operation will set the AltVec bit but will not cause the processor to enter debug mode (DINT must be 
used if entry into debug mode is desired.) The bit can only be cleared by software, and the DBBOOT signal must 
be deasserted for the clearing write to take effect. 


Enabling some features in extended debug mode places certain restrictions on software. Because EDM uses the 

Watch registers as breakpoints, setting EDM when the Cause[WP] bit is one results in UNDEFINED processor 

behavior. In addition, the behavior of the processor is UNDEFINED if the EDM and SStKS bits are modified 

while not in debug mode, since single step behavior cannot be guaranteed. Software should check or clear the _ | 
WP bit before setting the EDM bit, and handlers should be careful not to modify the state of the EDM and | 
SStKS bits outside of debug mode. The EDE and AltVec bits may, however, be modified as long as the previous | 
restrictions are honored. | 
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9.5 Debug Signal Pins 


The following table summarizes the external debug signals implemented on the SB-1 core: 


TABLE 9-6 Debug Signal Pins 


DINT DINT _|1 [Debug Interrupt: causes the processor to take adebug exception andenterdebugmode Debug Interrupt: causes the processor to take a debug exception and enter debug mode 


DBBOOT Debug Boot: forces the processor into debug mode after reset and initiates the instruction fetch from the alternate 
debug vector. Sets AltVec immediately. On Reset, causes immediate entry to DM at Alternate Vector. 


EDEN Extended Debug Event Notification: asserts to notify an external agent of a core debug event; initiated by the debug 
handler or hardware watchpoints 


DBBOOT can also be used to force the state of the AltVec bit in the EDebug register. Asserting DBBOOT 
during normal operation will set the AltVec bit but will not cause the processor to enter debug mode. (DINT 
must be used if entry into debug mode is desired.) The bit can only be cleared by software, and the DBBOOT 
signal must be deasserted for the clearing write to take effect. 
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CHAPTER IO — Error Handling 


10.1 Introduction 


‘core. Errors are classified into several groups 
ach of these classes are defined here: 


This chapter covers the error handling capabilities o 
depending on their relationship to the instruction 


detected the error. 


Imprecise - An exception is imprec 


Deterministic - A determunistt mprecise but the instruction that caused or detected the error can be 
determined by interpreting the program flow from the instruction indicated by the EPC or ErrorEPC. In general, 
instructions that cause’ er.detect deterministic errors are located within four instructions of that PC. 


Recoverable - A recover. error iS imprecise, but the instruction stream may be restarted from the instruction 
indicated by the EPC or ErrorEPC. In addition, recoverable errors are corrected by hardware, which completes 
the correction before the error is signaled. 
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Table 10-1 classifies the type of errors that are detectable in the SB-1 core and describes the type of exception 


taken by each: 


TABLE 10-1 SB-1 Error Types 


rvorType Econ type 


=| 
| 
i?) 
= 
bar | 
& 
ie) 
=. 
° 
i=] 
QO 
e 
= 
© 


Cache Error 


jemand pom 
3s 15 
212 
ale 
oO QO 
S.]<. 
© [o} 
315 
AITO 
p fo 
QO Q 
ms | 
(@>) fq’) 
Oia 
8 | fe 
pe) 
wy | > 
ry) Qa 
a 
< 18 
a | 
S$} 2. 
Ne 
Ss 
| S 
Len | 


Cache Error 
Data Cache 


Cache Error 


i= 
S| s 
oS) tN) 
cans 
B| 8 
oO oO 
rer ae 
da | da 
> |X 
a | p 
BG 
a8 
Vis 
Bis 
~~ | (TI 
< 19 
tn | oO 
er} = 
io) 

lon J 


Cache Error 
i Bit ECC Error Cache Error 
Bit ECC Error 


Ss) 
9 
~-- 
r) 
@) 
i 
‘@) 
a 
ao 
i) 
9 
a 
fo 
A 
=] 
a 
? 


Cache Error 


S 
S 
tet) 
“7 Q 
p> 
Qa 
im 
Oo 
Ss) 
5 
S&S 
Ss) 
= 
S: 
? 


i] 
se 
=) 


= 
C 
wo 
< 
= 
z 
oO 
mm 
= 
ie 
< 
f 
G. 
tao a 
3 
= 
to 
o 
= 
Q 
jo} 
= 
& 


Machine Check 


os) 


Cache Error 


Duplicate Tag Address Parity Error 


General 


S) 
i 
axe 
~" 
® 
ot 
fe) 
ry 
ga 
” 
~- 
eal 
ca) 
a) 
pe) 
a. 
Cong 
< 
es) 
4 
o 
mq 


Time Out Counter Expiration Machine Check 


All other errors are undetectable and may manifest themselves in unpredictable processor behavior. The 


subsequent sections detail the various error detecting and correcting properties of the different units on the CPU. 
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10.2 Instruction Cache 


In general, the instruction cache is parity protected in both the tag and data arrays. The tag fields and their 
protection are listed in Table 10-2. 


TABLE 10-2 Instruction Cache Tag Field Protection 


Fieta_|TagLo Bits Size [Protection _ | TagLo Bit 
[ise [ray nernfan 
ego |c151 [12 [Pay BPO |OHO) 
[i> [Paavo ro] 
aS [0-01 | [Pariy eo | __ 


Even parity is implemented in the instruction cache tag; that is, the number of ones in the protected data and the 
parity bit is an even number. The valid (V) bit is covered by a single parity bit (P). In the rest of the tag, PO 
covers the ASID, the G bit, and bits [24:13] of the address. P1 covers bits [43:25] of the address and the region 
bits. 


The instructions and the predecode bits are protected by a parity bit for every byte of data. As a result, one 
cacheline contains 32b of parity for the instructions and 4b of parity to cover the predecode bits (one bit of 
predecode parity covers two instructions’ predecode bits). Again, the parity for the data array is even parity, 
although this parity calculation is not visible to the user. 


Note that with a parity protection scheme, single bit errors can be detected, but not corrected, by the hardware. 
Also, some double bit errors cannot be detected. | 


10.2.1 Implementation Notes: 


The valid bit parity calculation factors into the hit signal. As a result, the V and P bits must match for the line to 
be valid in the cache. Tag address parity errors are calculated for each way and are signaled if any of the ways in 
the index contains an error, regardless of the hit/miss determination. Instruction and predecode parity errors are 
only reported if there is an error in the instructions being fetched. 


The LRU bits are not parity protected, but invalid combinations are flushed to a known valid format when an 


invalid state is detected. Errors detected in the valid and parity bits are never reported, although they are 
scrubbed to the invalid state (V==0, P==0). 
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Table 10-3 indicates the types of errors that can occur at different stages of an instruction access. 


TABLE 10-3 Instruction Access Error Types 


During an instruction cache miss, uncorrectable errors signaled on the return of the cacheline cause a cache or 
bus error exception to be taken. In either case, the data are not filled into the cache. 


10.3 Data Cache 


The Data Cache address in the tag array is protected by even parity, as shown in Table 10-4. 


TABLE 10-4 Data Cache Tag Protection 


Fietd |Tagho Bits|size| Protection _[TagLo Bt 


The state bits in the tag array are protected by a sparse encoding. This encoding is detailed in TagLo/TagHi 
descriptions in Chapter 6. 


In the data array, error correcting code (ECC) is implemented to correct single bit data errors and detect double 
bit errors. Both types of errors signal exceptions, although the single bit errors are corrected by hardware before 
the error handler is invoked. Single bit errors are imprecise, but the program may be restarted from the 
instruction address stored in the ErrorEPC register. An exception is taken on single bit errors so that software 
may log the error if desired. 


Double bit ECC errors are imprecise, but for load hits, the instruction that caused the error can be determined by 


interpreting the instruction stream from the instruction address stored in the ErrorEPC register. In general, the 
load instruction that caused the error is within four instructions in the dynamic instruction stream. 
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10.3.1 Implementation Notes 


Data cache errors may be detected by speculative loads, although the errors themselves will be reported on non- 
speculative instructions. 


The LRU bits are not parity protected, but invalid combinations are flushed to a known valid format when an 
invalid state is detected. In addition, the stream bit in the tag array is not protected, so errors in this bit are not 
detected. 


Table 10-5 indicates the types of errors that can occur at different stages of a load access: 


TABLE 10-5 Load Errors 


Table 10-6 indicates the types of errors that can occur at different stages of a store access, with the final column 
indicating whether the store data are written into the data cache. 


TABLE 10-6 Store Errors 


Tag State Parity-Cache Error 
Tag Address Parity-Cache Error 


Single Bit ECC-Cache Error 
Double Bit ECC-Cache Error 


During a data cache miss, uncorrectable errors signaled on the return of the cacheline cause either a cache or bus 
error exception to be taken. In both cases, the fill proceeds as it normally would, i.e. the data and address are 
written, but the tag state is marked with an error code. Note that this behavior may result in additional cache 
errors if the cache is subsequently accessed. 
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A fill may result in the eviction of a cacheline. If that cacheline contains an error, the processor may inhibit the 
writeback or complete the writeback by signaling an error during the data phase of the bus. Table 10-7 indicates 
what happens in each case: | | 


TABLE 10-7 Cacheline Errors due to Evicts | 


Error-Exception 


Finally, the data cache is checked for errors on coherency requests, and both invalidate and intervention requests 
cause the processor to take a cache error exception if the processor detects an error in any of the tags at the target 
index. Invalidates that detect tag errors do not change the tag state, and interventions always reply:to the 
requestor with an error indication, as shown in Table 10-8. . 


TABLE 10-8 Error-Exception Types and Interventions 


| 


10.4 TLB 


The processor TLB is protected from multiple entry matches by detecting such cases on TLB writes. When a 
TLB write is executed and a match occurs that would result in multiple matching entries in the TLB, the 
processor takes a machine check exception and sets the TS bit in the Status register. This exception is imprecise. 


When matching entries are detected, the processor does not write the TLB with the conflicting entry. Software 
may try to correct the situation by flushing the TLB, but before the error handler returns, it must clear the TS bit. 


10.4.1 Implementation Notes 


The TLB actually implements a hidden "valid" bit which is cleared by reset and only set on a TLB write. 
Matches cannot occur for an entry whose valid bit is cleared. Because the valid bit prevents matches after reset 
and because the processor prevents matching writes, there is no need to "shutdown" the TLB when an error 
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condition is detected as multiple entries will never be present. Software’s clearing the TS bit is simply based on 
architectural convention since the bit only indicates status. Note that even though the TS bit may be set, TLB 
lookups will continue as no harm will result, and TLB faults may occur if the handler uses mapped addresses. 


10.5 BIU 


The BIU forwards errors detected on the system bus to the exception unit. If an instruction request ends in an 

uncorrectable cache error or bus error, the fetch unit signals the appropriate error for that request. All instruction 
bus and cache errors are precise. A data request that ends in an error causes the processor to take the cache error 
or bus error exception at the earliest available instruction. Because the primary data cache is non-blocking, these 
errors are imprecise... In addition, the imprecise errors are held pending until the detected exception 1s taken. On 
all requests that end in an external error, the data returned have no meaning and are never written into the cache. 


The BIU contains no timeout mechanism for bus requests. The external agent is responsible for returning a bus 
elror On a processor request that cannot be serviced so the BIU request entry may be deallocated. 


The primary data cache duplicate tags also detect parity errors on all coherent bus transactions. Because the 
duplicate tags are basically shadow copies of the main tags, they have the same error properties and may detect 
errors in the state bits or address bits. Any way that contains an error causes a cache error to be taken. Each case 
is Outlined in Table 10-9. 


TABLE 10-9 Duplicate Tag State/Address Parity Cache Errors 


Tag State Parity-Cache Error mprecise | Error (Unowned) 
Tag Address Parity-Cache Error Error (Unowned) 


Any error in the duplicate tags causes the processor to indicate an error response. From the point of view of the 
bus protocol, the processor does not own the line. 


10.6 General 


The CPU implements a 29b timeout counter to detect when the processor is no longer executing instructions. 
The counter is reset to zero every cycle an instruction graduates and increments whenever zero instructions 
graduate. If the counter overflows, the processor takes a machine check exception, which releases the processor 
to begin execution at the exception handler. Processor state may be saved by the handler, but it is likely that the 
core needs to be initialized by a reset sequence to behave correctly. 
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Software can detect a machine check due to a timeout by reading the TO bit in the ErrCtl register. When the 
timeout counter expires, this bit is set, and it can only be cleared by reset. 


10.6.1 Implementation Notes: 


A 29b counter will cause a machine check after approximately 500ms at 1GHz (2%29 cycles elapse before a carry 
out of the counter is generated). 


10.7 Error Reporting Registers 


In general, the first source of error information can be found in the CPO ErrCtl register (number 26, select 0). For 
cache errors, this register indicates which cache, instruction or data, detected the error and whether the error is 
recoverable, i.e., corrected by the hardware. In addition, the register indicates when multiple bus errors have ~ 
occurred or what action has resulted in amachine check. 


ErrCtl (Register 26, Select Q): 


TABLE 10-10 CPO Err Ctl Register Fields 


IR _[Recowrabie Cache Bron dR 
De [DaaCectebror dR 
[iC _[Inseucion CacheEmor———SSSC—*dR 
s2aifo_|Reeved dR 
3 k_ 

ca 


ol oy — 
tot Pf ee 


— 
No 


23] Multiple Bus Errors Detected - | | 
igo [Reed 

15] TLB Shutdown Machine Check (Copy of Status TS Bit)|R_—_| 
14] Timeout Machine Check R | 


1 _fo Reeve SSCS 


The ERL bit modifies the behavior of the DC and IC bits when cache errors are detected. If ERL=0 and a cache 
error exception is taken, these bits log the cache in which the error occurred, and only one bit is set. If ERL=1, 
these bits become "sticky" so cache errors can accumulate while the processor is executing the cache error 
handler. In this way, data cache errors cannot mask instruction cache errors, or vice versa. 


— 


Note that if a Cache Error exception is taken and the R bit is set, then the handler may return immediately as the 
error has been corrected by hardware. Software may, however, log the error if desired since the CacheErr 
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registers contain valid information about the error. In addition, if the IC bit is set, the R bit is masked so that 
instruction cache errors are always serviced. 


The TS and TO bits may both be set in the ErrCtl register. In this case, the timeout error takes higher precedence 
than the TLB shutdown. 


The CacheErr-I register (number 27, select 0) indicates the cause and location of errors detected in the instruction 
cache. The TA bit indicates that the processor detected a parity error in the tag address array, while the D bit 
indicates that the parity error was detected in the fetched instructions. The E bit indicates that the error occurred 
on an external access. For tag errors, the Idx field is valid and indicates the cache index where the error was 
detected, but the Way field is unpredictable. For data errors, both the Idx and Way fields are valid and point 
where the error instructions are located in the cache. Neither the Idx nor the Way field is defined for external 
errors, so the EPC register should be used for address information. 


The following table specifies the CacheErr-I format and indicates when the Idx and Way fields are valid: 


TABLE 10-11 CacheErr-I Format 


Field |Description __—_—_—[ Access 
Sb1]0__ [Reeves IR 
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TABLE 10-12 Validity of Idx and Way Fields in ICache Errors 


Tag Address Invali 
External _| Invalid Invalid 


CacheErr-I (Register 27, Select 0): 
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The CacheErr-D register (number 27, select 1) indicates the cause and location of errors detected in the data 
cache. The TS bit indicates that the processor detected a parity error in the tag state array, while the TA bit 
signals a parity error in the tag address. If a single-bit ECC error is detected, then the DS bit is set, and the DD 
bit is set when a double-bit ECC error is detected. Finally, the E bit is set when a data cache access ends in an 
external error. 


Five bits in the CacheErr-D register indicate the type of access that caused the error to be detected: L for loads, 
S for stores, WB for writebacks, C for coherency requests (such as invalidates or interventions), and DT for 
duplicate tag accesses. 


In addition to the CacheErr-D register, another CPO register, CacheErr-DPA (register 27, select 3), captures the 
entire 40b physical address for data cache accesses. It is always written when an error is detected in the data 
cache. 


Table 10-13 specifies the CacheErr-D format and indicates when the Idx and Way fields are valid and when the 
‘PA is valid in the CacheErr-DPA register. _ 


TABLE 10-13 CacheErr- D Format 


> 

Q 
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S| Data Array Single-Bit ECC Err 


[27} Data Array Double-Bit ECC Err 
[26] External Cache Error 
[25] Error on Load Access 
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TABLE 10-14 Validity of Idx, Way and PA Fields in DCache Errors 


error (fix |Way [Pa 
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The CacheErr-D and CacheErr-DPA always capture the first signaled cache error, and the information stored in 
these registers is "locked" until software performs a write to the CacheErr-D register. If another cache error is 
detected before the lock is cleared, the M bit is set, indicating that multiple cache errors have been detected. Like 
the lock bit, the M bit is cleared by a write to the CacheErr-D register. 


sid 
pa | 
ga | da 
n| > 
Ss |a 
s+ | o 
oO 
nN 
N 


When a data cache access ends in a bus error, the physical address of that request is stored in a separate register, 
BusErr-DPA (number 26, select 1). Like the CacheErr-D and CacheErr-DPA registers, this register contains the 
first detected error, and the lock for the register may be cleared by a write to that register. Multiple bus errors are 
also indicated, and the MB bit in the ErrCtl register serves this purpose. 
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Table 10-15 provides a summary of error reporting registers. 


TABLE 10-15 Error Reporting Registers 
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CHAPTER II — The Performance Monitor 
Architecture 


11.1 Introduction 


architecture specification, but with additional ¢ 
architecture states include four pairs ef count 

events which are interesting to monitor for analyzing _ 
implementation in the future. | 
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11.2 Architecture State and Features 


The performance monitor mechanism uses a total of 11 registers. The traditional counter control and data 
registers are mapped to CPO Register 25 with Select from 0 to 7. The added event control and address registers 
are mapped under CPO Register 22 with Select from 0 to 2. The mapping of the register is shown in Table 1. 
Their functionality and contents are described in the following sessions. 


TABLE 11-1 Performance Counter Register Mapping 


we ee 
Event control register 0 
on oo Event control register 0 
Event control register 1 
et Event control register 1 

Event control register 2 

oxos 


0x05 rd Event control register 2 
a Event control register 3 


Event control register 3 
0x08-0x0F Reserved 


22 Event Event control register 
Event instruction address register 
Event memory address register 
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11.2.1 Event Counter and Control Registers (Register = 25, Select = 0x00, 0x01, 0x02, 0x03, 
0x04, 0x05, 0x06, 0x07) 


SB1 implements four 41-bit count up counters. They are readable and writable by software and updated 
implicitly by hardware event specified in the corresponding control register. The MSB (bit 40) is the overflow 
bit when there is a carry out from bit 39. It can be used to cause counter overflow interrupt and freeze the update 
of all the counters and event registers. Forty bits should be sufficient enough to last for about 18 minutes of 
execution time to count a 1-bit event at a frequency of 1 GHz. 


Each counter register is paired with a control register. The following figure shows the format of the counter 
control register. Table 2 describes the control register fields. The control register is both readable and writeable 
by software. 


31 | 30 29 11 10 5 4 3 2 I 0 
Cs 


FIGURE 11-1 Performance Counter Control Register 


TABLE 11-2 Performance Counter Control Register Field Description 


If this bit 1s "1, another pair of performance counter 
control and counter registers are implemented at n+2 and 
n+3. 


Freeze enable: Once a counter overflows, it freezes the 
other counters and registers, preventing further updates if 
this bit is set. 


29:11 | Must be set to 0. Return 0 on read. No exception is 


incurred if a 1 is written. 


Event 10:5 {ID of the event being monitored: Events are symmetrical 
to each counter. 


4 Overflow interrupt enable: Once a counter overflows, an 

overflow interrupt is triggered if this bit is set. 
Enable event counting in User mode. Undefined | Yes 
aa Enable event counting in Supervisor mode. 
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TABLE 11-2 Performance Counter Control Register Field Description 


Field Name SW Read/Write 


Enable event counting in Kernel mode. Undefined 
EXL jo | Enable event counting when EXL bit is set. Undefined 


A counter is 41-bit wide. It is incremented by the value of the input event specified in the control register each 
core cycle. The register format is shown in the following figure. When a counter generates a carry out of the bit 
39, its overflow bit (bit 40) is set. In addition, a counter overflow interrupt is incurred if the counter’s IE bit is 
set. The overflow freezes the update of all performance counter registers and event registers when the counter’s 
FE bit is set. This helps retain the counter values precisely to the point of counter overflow. If the FE bit is not 
set, the counter continues incrementing. 


39 0 


FIGURE 11-2 Performance Counter Register 


TABLE 11-3 Performance Counter Register Field Description 


Field Name [Bits [Description [SW Read/Write 


Event Count |39:0 |Incremented by the input values of the specified |R/W Undefined |New 
event each CPU cycle. 


11.2.2 Counter Overflow Interrupt 


Once a counter ’s overflow bit is set, if the IE bit is set, the hardware interrupt 5 should be raised and the interrupt 
bit IP(7) should be set in the Cause register. 
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11.2.3 Event Control and Address Registers (Select = 0x10, 0x11, 0x12) 


Instruction cache, data cache, and branch prediction performance are the most important factors in deciding 
overall machine performance. The interesting events for performance analysis are the instruction cache misses, 
data cache misses, and branch mispredictions. With detailed information about these events such as the address 
of the event and the status of the requested data, one can optimize the code to get around the potential 
performance problems. 


SB1 has a set of event address registers to assist the performance analysis: one for specifying the detailed 
conditions of the event to be captured, one for storing the event’s data address, and the other for storing the 
instruction address of the operations that cause the event. SB1 captures the virtual address of the instruction. It 
captures the physical address of the data address, due to the unavailability of virtual addresses at the point of 
capture. 7 


The formats of the registers are shown in the following figures. Table 4 lists the description of the control 
register fields. The fields in the control register are used to qualify four major sources of events: 


e Branch execution 
e Instruction cache misses 
e Data cache misses for loads 


e Data cache misses for stores. 


The control register is 32-bit wide. Fields EXL, K, S, and U are common to qualify all four sources of events. 
They specify the execution modes in which the events can be captured. Fields Cc, Cw, Pt, Pn, Ot, On, Bc, Br, Bi, 
Bu, Bc, and Bwii (bits 4 to 15) are used only to qualify the branch execution events. These fields are divided into 
five group of filters as follows: 


e Prediction result filter (Cc and Cw): they are masks to select the branches with correct or incorrect prediction 
results which include taken/not-taken predictions and indirect target predictions. When both bits are set, all 
the branches regardless of their prediction outcomes are included. When only one bit is set, only the branches 
with the selected outcome are included. When both bits are cleared, no branches are included. 


e Prediction filter (Pn and Pt): they are masks to select the branches with taken or not-taken predictions. When 
both bits are set, all the branches regardless of their prediction outcomes of taken/not-taken are included. 
When only one bit is set, only the branches with the selected prediction are included. When both bits are 
cleared, no branches are included. 


e Outcome filter (Ot and On): they are masks to select the branches with taken or not-taken outcomes. When 
both bits are set, all the branches regardless of their execution outcomes of taken/not-taken are included. 
When only one bit is set, only the branches with the selected outcome are included. When both bits are 
cleared, no branches are included. 


e Branch type filter (Bc, Br, Bi, Bu, and Bc): they are masks to select various types of branch instructions as 
specified in the table. 


SB-1 Users Manual 11-161 


Architecture State and Features | SiByte Confidential 


e Instruction watch filter (Bwi): it indicates whether or not the branch event needs to be qualified with instruc- 
tion watch register match. 


The final qualification of a branch event is the ANDing result of the execution modes, and the qualification from 
each of the above five categories. With all these fields, we have a powerful filtering logic to select the interesting 
sets of branch events for capturing. The logic expression of the filtering is shown as follows: 


Branch Event Qualification = (Execution mode filter) & (Prediction result filter) & (Prediction filter) & 


Execution outcome filter) & (Branch type filter) & (Instruction watch match filter) 


Note that unconditional, indirect, call, and return branches are always predicted taken, because their outcomes 
are always taken. In addition, since the predicted targets of unconditional branches are calculated, the prediction 
is always correct. 


The rest of the fields in the control register (W, R, I, Cwd, and Cw1) are used for qualifying cache miss events 
which include instruction cache misses, data cache misses for loads, and data cache misses for stores. They are 
used in the following ways: = 


e Access type filter (W, R, and I): they are masks to select various types of cache accesses. When ’W’ is set, the 
store accesses to the data cache are included. When R’is set, the load accesses to the data cache are included. 
When T’is set, the instruction cache miss accesses are included. With various bits being set, various types of 

_ cache accesses are included. | 


e Data watch register match filter (Cwd): it indicates the cache access event needs to be qualified with the 
match result of data watch register. This filter is available only for data cache accesses for loads (when R is 
set). It is a don’t care for all the other types of events. 


e Instruction watch register match filter (Cw1): it indicates the cache access event needs to be qualified with the 
match result of instruction watch register. This filter is available only for the instruction cache miss requests 
and data cache misses for loads (when I or R 1s set). It is a don’t care for data cache misses from stores. 


The final qualification of a cache access event is the ANDing result of the execution modes and the qualification 
from each of the above three categories. With all these fields, there is a powerful filtering logic to select the 
intersecting sets of cache miss events to capture. The logic expression of the filtering of each access type is 
shown below: 

Cache Event Qualification = (Execution mode filter) & (Access type filter) & 


(Data watch match filter) & (Instruction watch match filter) 
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Note that each cache access type should have independent qualification logic since they each use different filters. 


18 17 16 144 13 12 I 10 


19 15 9 8 7 6 5 4 3 2 ] 0 
[0 [ew fev TR [w [owl Be] Bu 


FIGURE 11-3 Cache Event Control Register 


31...21 20 


TABLE 11-4 Cache Event Control Register Field Description 


Field Name[Bits |Description (SW Read/Write [Rest State | Compliance 
31:21 | Write is ignored. Return 0 on read. No exception is incurred if New 
a 1 is wnitten 
20 {Cache event capturing should be qualified by the match result of | R/W Undefined |New 
instruction watch register 
Cache event capturing should be qualified by the match result of |R/W ndefined |New 
data watch register 


U 

Select instruction fetch to be captured Undefined |New 
Select data reads to be captured (for loads) Undefined |New 

R/W 

R/W 

R/W 

R/W 

R/W 
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Select data writes to be captured (for stores) R/W Undefined |New 
Branch event capturing should be qualified by the match result Undefined |New 
of instruction watch register 

Conditional branch mask; this bit selects all the conditional Undefined |New 
branches 

Unconditional branch mask; this bit selects all the unconditional 

jumps 


ow 
S 
wr 


co) 
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Undefined |New 


‘an | 


Procedure call mask: this bit selects all the procedure calls Undefined |New 


Not-Taken outcome mask: this bit selects the branches with not- Undefined |New 
taken outcome 

Not-Taken prediction mask: this bit selects the branches with |R/W Undefined |New 
not-taken predictions. 

Taken prediction mask: this bit selects the branches with taken |R/W Undefined |New 
predictions. 

Wrong prediction mask: this bit selects the branches with wrong |R/W Undefined |New 
predictions. 
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Indirect branch mask: this bit selects all the indirect branches. Undefined |New 
Procedure return mask: this bit selects all the procedure returns. Undefined |New 


Taken outcome mask: this bit selects the branches with taken 
outcome. 
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TABLE 11-4 Cache Event Control Register Field Description 


Field Name SW Read/Write 
4 Correct prediction mask: this bit selects the branches with R/W Undefined |New 
correct predictions. 


: 


Enable event capturing in User mode. Undefined 
Enable event capturing in Supervisor mode. RW Undefined 
Kok Enable event capturing in Kernel mode. RW Undefined 
EXL  |o | Enable event capturing when EXL bit is set. RW Undefined 


Each of the two event address registers is 64-bit wide. The instruction address register stores the virtual 
instruction address of either the load that causes a qualified data cache event or the branch that causes a qualified 
branch event. For instruction cache miss events and data cache miss events from stores, the instruction address 
register does not latch the instruction addresses, because the addresses are not available when the event is ~ 
captured. Since each instruction is 32-bit wide, the lowest 2 address bits are always 0. Therefore, the lowest two 
bits in the instruction address register are then used to indicate the type of instruction address being captured, as 
shown in Table 5. 


2/2] 2 


The format of the instruction address register is shown in the following figure. The description of the fields can 
be seen in Table 5. 


FIGURE 11-4 Event Instruction Address Register 


TABLE 11-5 Event Instruction Address Register Field Description 


FieldName [Bits [Description (SW Read/Write 


Virtual Address | 63:2 | The virtual instruction address of the instruction that is R/W Undefined |New 
qualified for the branch events or data cache events. 
Type 1:10 | Type of address captured: 00 is for not-taken branch, 01 is for Po Undefined 


taken branch, 10 is for data load, and 11 is reserved. 
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The data address register, on the other hand, stores either the missing data line addresses or missing instruction 
lime addresses, because all the data cache read/write accesses and instruction misses must go through the same 
pipeline before they are sent to the bus. Because of the timing of the event capturing, we store the physical 
address of the data/instruction line being referenced, instead of its virtual address. Since the physical line 
address is stored, the lowest five address bits are always 0. The physical address takes up bits 5 to 39 in the 
register. The higher order bits are used to indicate the status of the captured event. The format of the register is 
shown in Table 6. 


Bits I, R, and W indicate the type of the event as instruction cache miss, data cache miss from load, and data 
cache miss from store respectively. When a miss event is first captured, the pending bit is set until the data 
returns. Once the data actually returns, the pending bit is cleared and the source of the data return is recorded in 
the status bits as follows: 


e Dirty (D) bit: indicates the returned data 1s dirty 


e Other (O) bit: indicates the data is returned by the bus agents other than processors, secondary cache, and 
main memory 


e Main memory (M) bit: indicate the data is returned by main memory 
e Secondary Cache (C) bit: indicate the data is returned by the secondary cache 


e Processor (P) bit: indicate the data is returned by other processors 


63 62 61 60 59 58 57 56 55 54 40 39 5 4 0 
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FIGURE 11-5 Event Data Address Register 


TABLE 11-6 Event Data Address Register Field Description 


SW Read/Write 
Cache miss event captured is still pending for the return. 


New 
U 


aa ae 
i 
R_ 


[______[62_[Cspredeventisan nsruction cade miss oust 
[61 [Cape event isa data cache acess rom nad eget 


Undefined |New 
Undefined |New 
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TABLE 11-6 Event Data Address Register Field Description 


Field Name [Bits [Description 
C58 _|Capuredrequestis serviced by the secondary cache. [R 

56 Captured request is serviced by agents other than the Undefined |New 
P proceed memog | ea 
a_i 
oT Ro 


SW Read/Write | Reset State | Compliance 


Undefined |New 


39.5. |The memory physical address of data cache miss event or Undefined |New 
instruction fetch miss event. 


4:0 Reserved. Write is ignored. Read returns 0. 
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11.3 Performance Events 


The following table lists the performance events being designed in SB1. The ‘event description’ field shows the 
meaning of the event being generated. The ‘count description’ field shows what the intended count is. The ‘type’ 
field shows how an event is counted in the machine pipeline. It could be counted speculatively (Spec), non- 
speculatively (past the branch validation) (Non-S), or post graduation (Grd). The Max’ field shows the 
maximum count of an event in one cycle. The Src’ field shows the generator of the event signal. The event ID 
has not been assigned. They will be assigned based on the physical implementation later. The total number of 
events is currently 50. 


TABLE 11-7 Instruction Count 


Event Description [Count Description | Type_|Max | Sre 


Clock is high # of cycles on 
a BR instruction executed # of BR instructions executed Non-S | 4 


# of LD instructions executed total # of LD instructions executed (after | Non-S|1 
speculative point) 

# of ST instructions executed total # of ST instructions executed (after | Non-S}2 PE 
speculative point) 

a CPO instruction executed # of CPO instruction executed (after Non-S | 2 
speculative point) 


# of FLOPS executed # of FLOPS executed [Non-S}1— [PC | 
# of MOPS executed # of MOPS executed HNon-S]8 [PC | 


a store conditional executed # of store conditionals executed (after Grd {2 PC 
speculative point) 

a successful store conditional executed |# of successful store conditionals Grd PC 
executed (after speculative point) 


a fetch results in I-Cache miss # of fetch results in I-Cache miss Spec 


?) 
5 
= 
@ 
4 
Q 
@ 
w 
wa 
fy 
< 
@ 
= 
o 
wa 


= 
=m 
# of read hits in DCFIFO Spec 
= 


a valid I-Cache fill # of valid I-Cache fill Spec 


wi 'U 
BRM EE 
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TABLE 11-7 Instruction Count 


Event Description Count Description type |Max|Sre 
# of valid entries in read queue average life time of request in read queue Spec [8 |M- 
# of valid uncached entries in read queue |average life time of uncaches request in |Spec 

read queue 


a request hits in write buffer # of request hit in write buffer Non-S it 
a writeback occurs due to replacement _|# of write-backs due to replacement Non-S cae 


Interface 


a bus request is sent to ZB bus # of bus requests Non-S 


< 


~~] 
i= 
na 


Ee a bus read request is sent to ZB bus # of bus read requests INon-S]1 | 
ss a bus write request is sent to ZB bus # of bus write requests INon-S|1 | 


BIU stalls due to address bus busy # of stall cycles for waiting for address {|Non-S 
bus 


a snoop hits in write buffer # of snoop hits in write buffer Non-S 


a shared snoop hits on a shared line # of read shared snoop hits on a shared {Non-S 
line (no action) 

a shared snoop hits on an exclusive line _|# of read shared snoop hits on an Non-S}1 
exclusive line (intervention shared) 


M 


an exclusive snoop hits on shared line # of read exclusive snoop hits on shared |Non-S}1 
line (invalidate) 
an exclusive snoop hits on exclusive line |# of read exclusive snoop hits on Non-S | 1 
exclusive line (intervention exclusive) 
an invalidate snoop hits on shared line _| # of invalidate snoop hits on shared line | Non-S 
(invalidate or write-invalidate) 
an invalidate snoop hits on exclusive line |# of invalidate snoop hits on exclusive |Non-S 
line (invalidate or write-invalidate) 
snoop address queue 1s full # of cycles when snoop address queue is }|Non-S ia 
full 


TABLE 11-8 Microarchitectural Events 


# Event Deseription 


taken branch bubble visible (a bubble reaches |taken branch bubble visible (a bubble reaches issue 
issue when there is no valid instruction to when there is no valid instruction to issue) 
issue) 


Hi 
< 
@ 
| 
- 
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TABLE 11-8 Microarchitectural Events 


Event #| Event Description Count Description 


instruction cache miss bubble (a bubble instruction cache miss bubble (a bubble reaches issue 
reaches issue when there is no valid instruction | when there is no valid instruction to issue) 
to issue) 


Iuectiingy ——SSSSC~C~“~“‘sRSC“‘CNCCN#N#N#C#C#C“(#(N#SN$NNNCYN' 
# of instructions issued # of instructions issued Spec [4 | 


No valid instructions available for issuing # of cycles when no valid instructions are available for |Spec |4 
issuing 

Valid instructions are available for issuing but |# of cycles when valid instructions are available for Spec {4 

stopped by resource constraints issuing but stopped by resource constraints 

Valid instructions are available for issuing but |# of cycles when valid instructions are available for Spec j4 

stopped by dependency constraints issuing but stopped by dependency constraints 

Issue stopped by width limit # of cycles when maximum issue is achieved 


Replay and Misprediton a ae I 
a replay is signaled (data dependency, RQ full, |# of replays signaled (data dependency, RQ full, Pe | 


Hl 


DCFIFO, fill) DCFIFO, fill) 


repay cued by DCFIFO signaled |wofreplaycaed by DCFIFORN [Now S| 
Branch predieion/exceutondewis [SSS 
Cn 


NM 
sy — — —" — 
Pei ee 


11.4 Pending Issues 


e The assignment of the events to each counter with Mux minimization in mind 
e The assignment of only a subset of events to the first counter for denominators 


e Events 
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CHAPTER 12. ~—_Multiprocessing Support 


12.1 Introduction 


(LL-SC) 


e Load Linked Double: re Conditional Double (LLD-SCD) 


These instructions provide a fast and simple alternative to the Dekker or Peterson algorithms for mutual exclu- 
sion. When used properly, these instructions provide support for an atomic read-modify-write sequence, upon 
which standard mutual exclusion mechanisms can be easily built. 


The common usage for a busy-wait memory lock is as follows: 


1. SB-1250 is the first Multiprocessing System On a Chip (SOC) product that utilizes the SB-1 core. 
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Register $1 contains the address of a memory lock/semaphore, where 1 represents a locked state and 0 a clear 
State. 


-align 32 

.set noreorder 

TryAgain: LL S2,- OS) # get the lock 

. BNE S2, $0, TryAgain # if lock==1, spin 

ADDIU S27. Uy # lock==0, so $2=1 
sc $2.7 0(S.L) # lock=1 
BEQ $2, $0, TryAgain # if r-m-w fails ($2==0), spin 
NOP 

==== critical section ===== 
ADD $2, $0, $0 # $2=0 
SW 52; 0(S1) # lock=0 


If a 1 is written to $2 by the Store Conditional (SC), it indicates that the SC successfully updated the architectural 
view of memory location ($1). Otherwise, a 0 is written to $2. The memory lock must exist in cached coherent 
memory space, otherwise the results are UNPREDICTABLE. 


Although not necessary for correct behavior, aligning the LL-SC sequence into the same 32 bytes can reduce 
spin time by ensuring that an Icache miss never occurs between the Load Linked and the Store Conditional. 


There are several events which will cause the failure of Store Conditional instruction if they occur between a 
Load Linked and Store Conditional instruction pair. These include the following: 


e Completion of a coherent memory access to the same 32-byte aligned block of memory by another processor, 
e The occurrence of an exception on the processor executing the LL/SC instruction pair, 


e A line fill which forces the locked line out of the cache. 


In addition, the results of the Store Conditional are UNPREDICTABLE if: 


e The Store Conditional instruction is not preceded by a Load Linked instruction. 


e The Store Conditional instruction is preceded by a Load Linked instruction to a different physical or virtual 
address. 


Any of the above conditions may cause a Store Conditional to indicate success without actually guaranteeing 
atomic access to the memory block in question. 


The Load Linked-Store Conditional pairings do not explicitly guarantee faimess, only mutual exclusion. 
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12.3 Processor Synchronization 


The following sections elaborate on processor synchronization schemes available for SB-1 multiprocessing. 


12.3.1 Test and Set 


For an example of how this synchronization method operates in SB-1, refer to Section 12.2. 


12.3.2 Counter Based Synchronization 
The common usage for a counting semaphore is shown below. 


The memory lock contains the number of processes allowed in the critical section, and its address is specified by 
register 1 ($1) below: 


.-set noat 
.set noreorder 


-align 32 
TryAgainl: LL $2, 0($1) # Get memory lock 
BEQ $2, $0, TryAgainl # Check for non-zero result 
DADDIU S2;° $27 44 # Decrement Semaphore (delay slot) 
SC $2, 0(S$1) # Attempt to store 
BEQ $2, $0, TryAgainl # If failed, loop back 
NOP 
=== Critical Section === 
TryAgain2: LL $2, 0($1) # Get lock again 
DADDIU S22 S22 # Increment Semaphore 
Sc $2, 0($1) # Attempt store 
BEQ $2, $0, TryAgain2 # If failed, loop back 
NOP 


12.4 Coherency 


The following sections provide an overview of the supported memory model and cache organization in systems 
that use the SB-1 processor. 


12.4.1 Memory Model 


The SB-1 supports a weakly ordered memory model. This is an important consideration when using non-atomic 
multi-programming techniques such as producer-consumer structures and shared memory states. 
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The following rules apply to external visibility of memory accesses in either a cached coherent or cached 
noncoherent region of memory: 


e The order between a load and a store on the same processor is not guaranteed, and 


e The order between multiple loads is not guaranteed. 


12.5 Cache Organization and Coherency in SB-1 


To ensure code efficiency and correctness, the following need to be considered with regard to the cache 
organization in SB-1. 


12.5.1 Instruction Stream Modifications 


‘The SB-1 has split instruction and data caches. Any program that requires modification to its own instruction 
stream must obey the following sequence of events and must guarantee that all these events occur in the order 
shown on a single processor before attempting to execute the new code: 

Store the new instructions 

Flush the L1 data cache 

Flush the L1 instruction cache 


ee 


Execute a SYNC instruction 


In addition, all other processors in the system must be forced to flush their instruction caches and sync before 
attempting to execute the new code. Failure to do so may result in the execution of older cached code. 


12.5.2 Caching Attributes 
The SB-1 implements 4 different caching attributes at page granularity (refer to Chapter 6 for more detail): 


e Cached coherent, 
® Cached noncoherent, 
e Uncached, and 


e Uncached accelerated 


For cached coherent pages, a snoop-based protocol can be implemented by the encompassing system to maintain 
coherency across the agents. This protocol is highly efficient and, as such, manual optimization of the cached 
state of highly-volatile exclusive regions is not recommended (refer to SB-1250 Users Manual for specific 
details for this system level implementation.) 
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Cached noncoherent regions allow memory regions outside the range of the coherence protocol to be cached to 
improve performance. For this class of regions, such as a bus device with memory-mapped configuration data, 
the operating system is responsible for ensuring that memory writes to the region are coherent across processors. 
As a minor optimization, memory pages containing only instructions may also be placed in cached noncoherent 
regions. 


Multiprocessor memory contention should generally be avoided in all other modes. 


12.6 Processor Bringup 


After system reset, processor | in SB-1250 is held in reset mode to allow critical system initializations occur in a 
uniprocessor environment under processor 0. After system initialization is complete, processor 0 may "release" 
processor | by writing a 0 to bit 1 of the system_cfg register. 


Processor 1, when "released," begins executing at the normal reset vector. Typically the reset vector code 
includes a branch based on a read of the PRId register. 


Refer to SB-1250 User Manual for further documentation on supported MP, L2, and memory coherency 
protocols. 
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CHAPTER 13 —SB-I Implementation Specific 
Details 


Sees 


13.1 Introduction 


This chapter clarifies SB-1 implementation specific 
in the MIPS64 Manual are referenced and clarifies 
Manual followed by SB-1 implementation spe, 


. In articular, implementation-dependent comments 


Instructions 


¢ P 4: SB-1 implements Reverse Endianness. 
© P13: SB-1 does not implement CP2. 

e P15: SB-1 implements 5ERET and SDBBP as the only EJTAG instructions. 

e P41: Refer to Prefetch Description in Chapter 6 for supported prefetch hint bits in SB-1. 

e P 44: Refer to Prefetch Description in Chapter 6 for supported prefetch hint bits in SB-1. 

e P 46: For those Cache Operations that require an index, no translation of the effective address occurs in SB-1. 


e P47: An Address Error Exception (with cause code equal AdEL) may occur if the effective address refer- 
ences a portion of the kernel address space which would normally result in such an exception. 
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P 47: A data watch is not triggered by a cache instruction whose address matches the Watch register address 
match conditions. 


P 48: DataLo and DataHi registers are not implemented in SB-1. 
P 49: Code 011 is not implemented in SB-1. 

P 50: “Fetch and Lock” Cache Op is not implemented in SB-1. 
P 53: SB-1 includes the standard TLB MMU. 


P 54-59: For all TLB instructions (TLBR, TLBWI, TLBWR), no masking is involved in the VPN2 and PFN 
fields of EntryHi, EntryLoO and EntryLol registers. All bits are preserved after a TLB entry is written and 
then read. 

P 60: SB-1 does not implement the WAIT instruction; it is treated as a noop. 

P 61: The format of FIR register is described in Chapter 4 of this document. 

P 62: SB-1 will flush all denormals to zero if flush to zero is enabled. It will also flush all underflow results 


to zero. If flush to zero is disabled, the SB-1 will cause an unimplemented operation exception for denormal 
inputs and underflowing results for arithmetic operations. 


13.3 Clarifications on Implementation-Dependent Privileged 
Instructions 


P 70: SB-1 does not implement CPO Reg22. 

P 73: In SB-1, SEGBITS = 44 and PABITS = 40. 
P 76: SB-1 implements 64-bit addressing. 

P 76: SB-1 implements Supervisor Mode. 


P 79: Refer to the next two bullets for implementation-dependent behavior when Statuspp, =I: : 


P 83: For kuseg segment when Statuspp, =1, the lower 23! byte segment of kuseg is treated as an ummapped 


uncached segment. For 64-bit addressing mode, when the UX bit is set in CPO register, for range of addresses 
between 2?! and qM4 the following address translation occurs: bits 39 to 32 of the translated PA are all zeros, 
bits 31 to bit 0 of the translated PA are the same as the corresponding bits of the virtual address; the cache 


attribute is that of uncached type. 
P 85: Refer to Chapter 7 for the TLB format in SB-1. 
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e P94: The ErrorEPC register is loaded with PC-4 if the state of the processor indicates that it was executing 
an instruction in the delay slot of a branch. Otherwise, the ErrorEPC register is loaded with PC. Note that this 
value may or may not be predictable if the Reset Exception was taken as the result of power being applied to 
the processor because PC may not have a valid value in that case. In SB-1, the value loaded into ErrorEPC 
register 1s not predictable on a Reset. 


e P94: Soft Reset exception is not implemented in SB-|!. 
e P95: NMI exception is implemented in SB-1. 
e P96: Machine check exception is implemented in SB-1 for TLB/Time out. 


e P96: In SB-1, detection of multiple matching entries in the TLB occurs on the TLB write that creates multi- 
ple matching entries. 

e P99: In SB-1, a cache error exception resulting from an access to the data cache is generally reported impre- 
cisely with respect to the instruction that caused the cache error. 


e P 100: In SB-1, a data bus error exception 1s reported imprecisely with respect to the instruction that caused 
the bus error. 


e P 102: From the MIPS64 Manual: “Some implementations of previous ISAs reported this case as a Floating 
Point Exception, setting the Unimplemented Operation bit in the Cause field of the FCSR register.” SB-1 
does not do this. 


e P 103: If the EXL or ERL bits are one in the Status register and a single instruction generates both a watch 
exception (which is deferred by the state of the EXL and ERL bits) and a lower-priority exception, the lower 
priority exception is taken. In SB-1, the WP bit is not set in this case. 


e P 103: In SB-1, a data watch exception 1s not triggered by a prefetch or cache instruction whose address 
matches the Watch register address match conditions. 


e P 105: In SB-1, the width of the index field matches the size of the TLB. 


e P 106: The random CP0 register is incremented by one for each cycle that has more than zero intruction(s) 
graduated, except for the cycles that have a TLBWI or TLBWR graduated. For every 3rd of such cycles, the 
random CP0 register is not incremented. 


e P 108: Refer to Chapter 6 for a full listing of SB-1 implemented cache coherency attributes. 
e P 112: SB-1 Implements 4K, 16K, 64K, 256K, 1M, 4M, 16M, and 64M! page sizes. 


e P 114: The Count register acts as a timer, incrementing at a constant rate, whether or not an instruction is exe- 
cuted, retired, or any forward progress is made through the pipeline. For SB-1, the rate at which the counter 
increments is once per cycle. 


1. 64M support will be in Pass2 of SB-1. 


SB-1 Users Manual 13-179 


Clarifications on Implementation-Dependent Privileged Instructions . SiByte Confidential 


e P 116: When the value of the Count register equals the value of the Compare register, an interrupt request is 
ORed with hardware interrupt 5 to set interrupt bit IP(7) in the Cause register. This causes an interrupt as 
soon as the interrupt is enabled. 


e P 117: In SB-1, the RP bit is not implemented. 


¢ P 119: The TS bit indicates that the TLB has detected a match on multiple entries. In SB-1, this detection 
occurs on a write to the TLB. 


e P 120: Supervisor Mode is implemented. x 
e P 128: For a description of PRId Register format in SB-1, refer to Chapter 8 in this document. 
e P 132: For a description of LLAddr Register format in SB-1, refer to Chapter 8 in this document. 


e P 133: SB-1 provides two pairs of WatchLo and WatchHi registers, referencing them via the select field of the 
MTC0/MFCO and DMTCO/DMEFC0 instructions. Refer to Chapter 9 in this document. 


e P 133: In SB-1, a data watch is not triggered by a prefetch or a cache instruction whose address matches the 
Watch register address match conditions. 


e P 137: For a list of performance counters implemented in SB-1, refer to Chapter 11 in this document. 

e P 140: For the exact format and operation of the ErrCtl register, refer to Chapter 10 in this document. 

e P 140: For the exact format and operation of the CacheErr register, refer to Chapter 10 in this document. 
e P 142: For the exact format of the TagLo and TagHi registers, refer to Chapter 6 in this document. 

e P 145: For a list of CPO Hazards in SB-1 refer to Chapter 8. 
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